Monday, September 23, 2019

Apache Hive getting started


In this post I'm going to explain installation of apache hive and some first steps.

Download

Of course first we have to download everything: hive and hadoop (hive will not work without it). For me, closest locations for download were:

http://ftp.man.poznan.pl/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
http://ftp.man.poznan.pl/apache/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz

If you don't have java installed - you should install it as well.

Unpack and env setup

Now we have to unpack downloaded archives somewere. I unpacked them into folders:
/home/dmitry/Develop/hadoop
/home/dmitry/Develop/hive

Next we have to create a folder for hive warehouse:
$ sudo mkdir -p /user/hive/warehouse
$ sudo chmod a+rwx /user/hive
$ sudo chmod a+rwx /user/hive/warehouse


To setup env settings we can create a file hadoop-env.sh: 

#!/bin/sh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
export HADOOP_HOME=/home/dmitry/Develop/hadoop
export HIVE_HOME=/home/dmitry/Develop/hive

export PATH=$PATH:$HADOOP_HOME/bin:$HIVE_HOME/bin

Please make sure JAVA_HOME points to proper location. 

To put all these variables into current session, run in terminal:
$source ./hadoop-env.sh

Another option is to add everything(content of file from above) into  $HOME/.bashrc


Checking if everything configured properly

Actually hadoop is not confugured and hdfs daemons are not started but we this is not neede for hive. We can check it this way:

Hadoop test: 
dmitry@dmitry-ThinkPad-X220:~/Develop$ hdfs dfs -ls /home
Found 3 items                                                                                                                         
-rw-r--r--   1 root   root           31 2018-04-09 19:28 /home/.directory                                                            
drwxr-xr-x   - dmitry dmitry       4096 2019-09-06 21:58 /home/dmitry                                                                 
drwx------   - root   root        16384 2019-08-29 21:11 /home/lost+found 

If it shows my files, so it works. 

Hive test:

First we have to init meta-data database:
dmitry@dmitry-ThinkPad-X220:~/Develop$ cd $HIVE_HOME
dmitry@dmitry-ThinkPad-X220:~/Develop/hive$ schematool -initSchema -dbType derby

At the end of execution it should print:
Initialization script completed
schemaTool completed

And now we can run hive CLI and try to create a table:


dmitry@dmitry-ThinkPad-X220:~/Develop$ hive
                   
Hive Session ID = a62b7d5e-1955-483d-973a-f7416626ebf8                                                                              
                                                                                                                                      
Logging initialized using configuration in jar:file:/home/dmitry/Develop/hive-3.1.2/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> 

If everything is ok it should be possible to create a table:
hive> CREATE TABLE TEST(ID INT);
OK
Time taken: 0.233 seconds
hive>

Now we can try to insert new values into just created table:
insert into test values(1);

And checking:
hive> select * from test;
OK
1

Value was returned, but as you can see column header is disabled by default. Let's enable it:
hive>  set hive.cli.print.header=true;
hive> select * from test;
OK
test.id
1

If you always prefer seeing the headers, put the first line in your $HOME/.hiverc file.


Work with databases

By default, hive use "default" database:
hive> show databases;
OK
database_name
default

But can always create our own if its needed:
create database mydb comment 'my test db';

now we have : 

hive> show databases;
OK
database_name
default
mydb
Time taken: 0.027 seconds, Fetched: 2 row(s)

To get the details: 
hive> describe database mydb;
OK
db_name comment location        owner_name      owner_type      parameters
mydb    my test db      file:/user/hive/warehouse/mydb.db       dmitry  USER
Time taken: 0.036 seconds, Fetched: 1 row(s)

To switch database we can use:
hive> use mydb;
OK
Time taken: 0.031 seconds

To check tables:
hive> show tables;
OK
tab_name
Time taken: 0.039 seconds


Also  we can configure hive to print database name in prompt:
hive>  set hive.cli.print.current.db=true;
hive (mydb)> 
- and instead of "hive" we will have "hive (database name)"


Work with tables:  Internal(Managed) Tables

Managed Tables
The tables we have created so far are called managed tables or sometimes called inter-
nal tables, because Hive controls the lifecycle of their data (more or less). As we’ve seen,
Hive stores the data for these tables in a subdirectory under the directory defined by
hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.
When we drop a managed table, Hive deletes
the data in the table.

Syntax of table creation is similar to regular databases:
hive (mydb)> CREATE TABLE IF NOT EXISTS users (
           >   user_name  STRING COMMENT 'Name',
           >   user_roles  ARRAY<STRING> COMMENT 'Roles',
           >   user_address STRUCT<city:STRING, street:STRING, zip:INT> COMMENT 'Address'
           > )
           > COMMENT 'My Table'
           > TBLPROPERTIES ('creator'='Dmitry');


Work with tables:  External Tables

The EXTERNAL keyword tells Hive this table is external and the LOCATION ... clause is
required to tell Hive where it’s located. Because it’s external, Hive does not assume it owns the data. Therefore, dropping the table does not delete the data, although the metadata for the table will be deleted.

Let's say we have a folder:
/user/hive/groups

We can create a csv file here with content(name of file doesn't matter):
1,Sysdba
2,Dev
3,Others

And now we can create an external table:

CREATE EXTERNAL TABLE IF NOT EXISTS groups (
group_id INT,
group_name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/hive/groups';

Let's check it:
hive (mydb)> select * from groups;
OK
groups.group_id groups.group_name
1       Sysdba
2       Dev
3       Others
Time taken: 0.133 seconds, Fetched: 3 row(s)

To understand "who is who" (internal table or external) we can use command "describe extended":

hive (mydb)> describe extended groups;
OK
col_name        data_type       comment
group_id                int                                         
group_name              string                                      
                 
Detailed Table Information      Table(tableName:groups, dbName:mydb, owner:dmitry, createTime:1568231912, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:group_id, type:int, comment:null), FieldSchema(name:group_name, type:string, comment:null)], location:file:/user/hive/groups, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{transient_lastDdlTime=1568231912, bucketing_version=2, totalSize=44, EXTERNAL=TRUE, numFiles=1}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE, rewriteEnabled:false, catName:hive, ownerType:USER)
Time taken: 0.096 seconds, Fetched: 4 row(s)

From this output we can see: tableType:EXTERNAL_TABLE


Partitioning

To create partitioned table we just have to add to table definition PARTITIONED BY and list columns. Let's drop previously creater table users and re-create it partitioned by department:

CREATE TABLE IF NOT EXISTS users (
 user_name  STRING COMMENT 'Name',
 user_roles  ARRAY<STRING> COMMENT 'Roles',
user_address STRUCT<city:STRING, street:STRING, zip:INT> COMMENT 'Address'
)
PARTITIONED BY (user_department STRING);


Now let's add some data:
hive (mydb)> insert into users (user_name, user_department) values("joe","AAA");
hive (mydb)> insert into users (user_name, user_department) values("moe","BBB");


And let's now check what we have in file system:
dmitry@dmitry-ThinkPad-X220:/user/hive/warehouse/mydb.db/users$ pwd
/user/hive/warehouse/mydb.db/users
dmitry@dmitry-ThinkPad-X220:/user/hive/warehouse/mydb.db/users$ ll
total 16
drwxr-xr-x 4 dmitry dmitry 4096 wrz 12 21:07  ./
drwxr-xr-x 3 dmitry dmitry 4096 wrz 12 21:04  ../
drwxr-xr-x 2 dmitry dmitry 4096 wrz 12 21:07 'user_department=AAA'/
drwxr-xr-x 2 dmitry dmitry 4096 wrz 12 21:07 'user_department=BBB'/

Values with different department are located in different folders.



Exporting data

To export data from tables we can use next syntax:
hive> INSERT OVERWRITE LOCAL DIRECTORY '/home/dmitry/hive-export' SELECT * FROM GROUPS;

Now we can check output directly from hive:
hive> ! less /home/dmitry/hive-export/000000_0;
1Sysdba
2Dev
3Others


Loading data

To load data we can use next syntax:
LOAD DATA LOCAL INPATH '/home/dmitry/hive-export' OVERWRITE INTO TABLE GROUPS;


The end

And this is basically the end of this post:)