In this post I'm going to explain installation of apache hive and some first steps.
Download
Of course first we have to download everything: hive and hadoop (hive will not work without it). For me, closest locations for download were:http://ftp.man.poznan.pl/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
http://ftp.man.poznan.pl/apache/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
If you don't have java installed - you should install it as well.
Unpack and env setup
Now we have to unpack downloaded archives somewere. I unpacked them into folders:/home/dmitry/Develop/hadoop
/home/dmitry/Develop/hive
Next we have to create a folder for hive warehouse:
$ sudo mkdir -p /user/hive/warehouse
$ sudo chmod a+rwx /user/hive
$ sudo chmod a+rwx /user/hive/warehouse
To setup env settings we can create a file hadoop-env.sh:
#!/bin/sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
export HADOOP_HOME=/home/dmitry/Develop/hadoop
export HIVE_HOME=/home/dmitry/Develop/hive
export PATH=$PATH:$HADOOP_HOME/bin:$HIVE_HOME/bin
Please make sure JAVA_HOME points to proper location.
To put all these variables into current session, run in terminal:
$source ./hadoop-env.sh
Another option is to add everything(content of file from above) into $HOME/.bashrc
Checking if everything configured properly
Actually hadoop is not confugured and hdfs daemons are not started but we this is not neede for hive. We can check it this way:
Hadoop test:
dmitry@dmitry-ThinkPad-X220:~/Develop$ hdfs dfs -ls /home
Found 3 items
-rw-r--r-- 1 root root 31 2018-04-09 19:28 /home/.directory
drwxr-xr-x - dmitry dmitry 4096 2019-09-06 21:58 /home/dmitry
drwx------ - root root 16384 2019-08-29 21:11 /home/lost+found
If it shows my files, so it works.
Hive test:
First we have to init meta-data database:
dmitry@dmitry-ThinkPad-X220:~/Develop$ cd $HIVE_HOME
dmitry@dmitry-ThinkPad-X220:~/Develop/hive$ schematool -initSchema -dbType derby
At the end of execution it should print:
Initialization script completed
schemaTool completed
And now we can run hive CLI and try to create a table:
dmitry@dmitry-ThinkPad-X220:~/Develop$ hive
Hive Session ID = a62b7d5e-1955-483d-973a-f7416626ebf8
Logging initialized using configuration in jar:file:/home/dmitry/Develop/hive-3.1.2/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive>
If everything is ok it should be possible to create a table:
hive> CREATE TABLE TEST(ID INT);
OK
Time taken: 0.233 seconds
hive>
Now we can try to insert new values into just created table:
insert into test values(1);
And checking:
hive> select * from test;
OK
1
Value was returned, but as you can see column header is disabled by default. Let's enable it:
hive> set hive.cli.print.header=true;
hive> select * from test;
OK
test.id
1
Now we can try to insert new values into just created table:
insert into test values(1);
And checking:
hive> select * from test;
OK
1
Value was returned, but as you can see column header is disabled by default. Let's enable it:
hive> set hive.cli.print.header=true;
hive> select * from test;
OK
test.id
1
If you always prefer seeing the headers, put the first line in your $HOME/.hiverc file.
Work with databases
By default, hive use "default" database:
hive> show databases;
OK
database_name
default
But can always create our own if its needed:
create database mydb comment 'my test db';
now we have :
hive> show databases;
OK
database_name
default
mydb
Time taken: 0.027 seconds, Fetched: 2 row(s)
To get the details:
hive> describe database mydb;
OK
db_name comment location owner_name owner_type parameters
mydb my test db file:/user/hive/warehouse/mydb.db dmitry USER
Time taken: 0.036 seconds, Fetched: 1 row(s)
To switch database we can use:
hive> use mydb;
OK
Time taken: 0.031 seconds
To check tables:
hive> show tables;
OK
tab_name
Time taken: 0.039 seconds
Also we can configure hive to print database name in prompt:
hive> set hive.cli.print.current.db=true;
hive (mydb)>
- and instead of "hive" we will have "hive (database name)"
Work with tables: Internal(Managed) Tables
Managed Tables
The tables we have created so far are called managed tables or sometimes called inter-
nal tables, because Hive controls the lifecycle of their data (more or less). As we’ve seen,
Hive stores the data for these tables in a subdirectory under the directory defined by
hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.
When we drop a managed table, Hive deletes
the data in the table.
The tables we have created so far are called managed tables or sometimes called inter-
nal tables, because Hive controls the lifecycle of their data (more or less). As we’ve seen,
Hive stores the data for these tables in a subdirectory under the directory defined by
hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.
When we drop a managed table, Hive deletes
the data in the table.
Syntax of table creation is similar to regular databases:
hive (mydb)> CREATE TABLE IF NOT EXISTS users (
> user_name STRING COMMENT 'Name',
> user_roles ARRAY<STRING> COMMENT 'Roles',
> user_address STRUCT<city:STRING, street:STRING, zip:INT> COMMENT 'Address'
> )
> COMMENT 'My Table'
> TBLPROPERTIES ('creator'='Dmitry');
Work with tables: External Tables
The EXTERNAL keyword tells Hive this table is external and the LOCATION ... clause is
required to tell Hive where it’s located. Because it’s external, Hive does not assume it owns the data. Therefore, dropping the table does not delete the data, although the metadata for the table will be deleted.
Let's say we have a folder:
/user/hive/groups
And now we can create an external table:
CREATE EXTERNAL TABLE IF NOT EXISTS groups (
group_id INT,
group_name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/hive/groups';
Let's say we have a folder:
/user/hive/groups
We can create a csv file here with content(name of file doesn't matter):
1,Sysdba
2,Dev
3,Others
CREATE EXTERNAL TABLE IF NOT EXISTS groups (
group_id INT,
group_name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/hive/groups';
Let's check it:
hive (mydb)> select * from groups;
OK
groups.group_id groups.group_name
1 Sysdba
2 Dev
3 Others
Time taken: 0.133 seconds, Fetched: 3 row(s)
To understand "who is who" (internal table or external) we can use command "describe extended":
hive (mydb)> describe extended groups;
OK
col_name data_type comment
group_id int
group_name string
Detailed Table Information Table(tableName:groups, dbName:mydb, owner:dmitry, createTime:1568231912, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:group_id, type:int, comment:null), FieldSchema(name:group_name, type:string, comment:null)], location:file:/user/hive/groups, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{transient_lastDdlTime=1568231912, bucketing_version=2, totalSize=44, EXTERNAL=TRUE, numFiles=1}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE, rewriteEnabled:false, catName:hive, ownerType:USER)
Time taken: 0.096 seconds, Fetched: 4 row(s)
From this output we can see: tableType:EXTERNAL_TABLE
Now let's add some data:
hive (mydb)> insert into users (user_name, user_department) values("joe","AAA");
hive (mydb)> insert into users (user_name, user_department) values("moe","BBB");
And let's now check what we have in file system:
dmitry@dmitry-ThinkPad-X220:/user/hive/warehouse/mydb.db/users$ pwd
/user/hive/warehouse/mydb.db/users
dmitry@dmitry-ThinkPad-X220:/user/hive/warehouse/mydb.db/users$ ll
total 16
drwxr-xr-x 4 dmitry dmitry 4096 wrz 12 21:07 ./
drwxr-xr-x 3 dmitry dmitry 4096 wrz 12 21:04 ../
drwxr-xr-x 2 dmitry dmitry 4096 wrz 12 21:07 'user_department=AAA'/
drwxr-xr-x 2 dmitry dmitry 4096 wrz 12 21:07 'user_department=BBB'/
Partitioning
To create partitioned table we just have to add to table definition PARTITIONED BY and list columns. Let's drop previously creater table users and re-create it partitioned by department:
CREATE TABLE IF NOT EXISTS users (
user_name STRING COMMENT 'Name',
user_roles ARRAY<STRING> COMMENT 'Roles',
user_address STRUCT<city:STRING, street:STRING, zip:INT> COMMENT 'Address'
)
PARTITIONED BY (user_department STRING);
PARTITIONED BY (user_department STRING);
Now let's add some data:
hive (mydb)> insert into users (user_name, user_department) values("joe","AAA");
hive (mydb)> insert into users (user_name, user_department) values("moe","BBB");
And let's now check what we have in file system:
dmitry@dmitry-ThinkPad-X220:/user/hive/warehouse/mydb.db/users$ pwd
/user/hive/warehouse/mydb.db/users
dmitry@dmitry-ThinkPad-X220:/user/hive/warehouse/mydb.db/users$ ll
total 16
drwxr-xr-x 4 dmitry dmitry 4096 wrz 12 21:07 ./
drwxr-xr-x 3 dmitry dmitry 4096 wrz 12 21:04 ../
drwxr-xr-x 2 dmitry dmitry 4096 wrz 12 21:07 'user_department=AAA'/
drwxr-xr-x 2 dmitry dmitry 4096 wrz 12 21:07 'user_department=BBB'/
Values with different department are located in different folders.
hive> INSERT OVERWRITE LOCAL DIRECTORY '/home/dmitry/hive-export' SELECT * FROM GROUPS;
Now we can check output directly from hive:
hive> ! less /home/dmitry/hive-export/000000_0;
1Sysdba
2Dev
3Others
Exporting data
To export data from tables we can use next syntax:hive> INSERT OVERWRITE LOCAL DIRECTORY '/home/dmitry/hive-export' SELECT * FROM GROUPS;
hive> ! less /home/dmitry/hive-export/000000_0;
1Sysdba
2Dev
3Others
Loading data
To load data we can use next syntax:
LOAD DATA LOCAL INPATH '/home/dmitry/hive-export' OVERWRITE INTO TABLE GROUPS;
LOAD DATA LOCAL INPATH '/home/dmitry/hive-export' OVERWRITE INTO TABLE GROUPS;