development blog: September 2019

In this post I'm going to explain installation of apache hive and some first steps.

Download

Of course first we have to download everything: hive and hadoop (hive will not work without it). For me, closest locations for download were:

http://ftp.man.poznan.pl/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
http://ftp.man.poznan.pl/apache/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz

If you don't have java installed - you should install it as well.

Unpack and env setup

Now we have to unpack downloaded archives somewere. I unpacked them into folders:
/home/dmitry/Develop/hadoop
/home/dmitry/Develop/hive

Next we have to create a folder for hive warehouse:
$ sudo mkdir -p /user/hive/warehouse
$ sudo chmod a+rwx /user/hive
$ sudo chmod a+rwx /user/hive/warehouse

To setup env settings we can create a file hadoop-env.sh:

#!/bin/sh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

export HADOOP_HOME=/home/dmitry/Develop/hadoop

export HIVE_HOME=/home/dmitry/Develop/hive

export PATH=$PATH:$HADOOP_HOME/bin:$HIVE_HOME/bin

Please make sure JAVA_HOME points to proper location.

To put all these variables into current session, run in terminal:
$source ./hadoop-env.sh

Another option is to add everything(content of file from above) into $HOME/.bashrc

Checking if everything configured properly

Actually hadoop is not confugured and hdfs daemons are not started but we this is not neede for hive. We can check it this way:

Hadoop test:

dmitry@dmitry-ThinkPad-X220:~/Develop$ hdfs dfs -ls /home

Found 3 items

-rw-r--r-- 1 root root 31 2018-04-09 19:28 /home/.directory

drwxr-xr-x - dmitry dmitry 4096 2019-09-06 21:58 /home/dmitry

drwx------ - root root 16384 2019-08-29 21:11 /home/lost+found

If it shows my files, so it works.

Hive test:

First we have to init meta-data database:

dmitry@dmitry-ThinkPad-X220:~/Develop$ cd $HIVE_HOME

dmitry@dmitry-ThinkPad-X220:~/Develop/hive$ schematool -initSchema -dbType derby

At the end of execution it should print:

Initialization script completed

schemaTool completed

And now we can run hive CLI and try to create a table:

dmitry@dmitry-ThinkPad-X220:~/Develop$ hive

Hive Session ID = a62b7d5e-1955-483d-973a-f7416626ebf8

Logging initialized using configuration in jar:file:/home/dmitry/Develop/hive-3.1.2/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true

Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

hive>

If everything is ok it should be possible to create a table:

hive> CREATE TABLE TEST(ID INT);

Time taken: 0.233 seconds

hive>

Now we can try to insert new values into just created table:
insert into test values(1);

And checking:
hive> select * from test;
OK
1

Value was returned, but as you can see column header is disabled by default. Let's enable it:
hive> set hive.cli.print.header=true;
hive> select * from test;
OK
test.id
1

If you always prefer seeing the headers, put the first line in your $HOME/.hiverc file.

Work with databases

By default, hive use "default" database:

hive> show databases;

database_name

default

But can always create our own if its needed:

create database mydb comment 'my test db';

now we have :

hive> show databases;

database_name

default

mydb

Time taken: 0.027 seconds, Fetched: 2 row(s)

To get the details:

hive> describe database mydb;

db_name comment location owner_name owner_type parameters

mydb my test db file:/user/hive/warehouse/mydb.db dmitry USER

Time taken: 0.036 seconds, Fetched: 1 row(s)

To switch database we can use:

hive> use mydb;

Time taken: 0.031 seconds

To check tables:

hive> show tables;

tab_name

Time taken: 0.039 seconds

Also we can configure hive to print database name in prompt:

hive> set hive.cli.print.current.db=true;

hive (mydb)>

- and instead of "hive" we will have "hive (database name)"

Work with tables: Internal(Managed) Tables

Managed Tables
The tables we have created so far are called managed tables or sometimes called inter-
nal tables, because Hive controls the lifecycle of their data (more or less). As we’ve seen,
Hive stores the data for these tables in a subdirectory under the directory defined by
hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.
When we drop a managed table, Hive deletes
the data in the table.

Syntax of table creation is similar to regular databases:

hive (mydb)> CREATE TABLE IF NOT EXISTS users (

> user_name STRING COMMENT 'Name',

> user_roles ARRAY<STRING> COMMENT 'Roles',

> user_address STRUCT<city:STRING, street:STRING, zip:INT> COMMENT 'Address'

> )

> COMMENT 'My Table'

> TBLPROPERTIES ('creator'='Dmitry');

Work with tables: External Tables

The EXTERNAL keyword tells Hive this table is external and the LOCATION ... clause is

required to tell Hive where it’s located. Because it’s external, Hive does not assume it owns the data. Therefore, dropping the table does not delete the data, although the metadata for the table will be deleted.

Let's say we have a folder:
/user/hive/groups

We can create a csv file here with content(name of file doesn't matter):

1,Sysdba

2,Dev

3,Others

And now we can create an external table:

CREATE EXTERNAL TABLE IF NOT EXISTS groups (
group_id INT,
group_name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/hive/groups';

Let's check it:

hive (mydb)> select * from groups;

groups.group_id groups.group_name

1 Sysdba

2 Dev

3 Others

Time taken: 0.133 seconds, Fetched: 3 row(s)

To understand "who is who" (internal table or external) we can use command "describe extended":

hive (mydb)> describe extended groups;

col_name data_type comment

group_id int

group_name string

Detailed Table Information Table(tableName:groups, dbName:mydb, owner:dmitry, createTime:1568231912, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:group_id, type:int, comment:null), FieldSchema(name:group_name, type:string, comment:null)], location:file:/user/hive/groups, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{transient_lastDdlTime=1568231912, bucketing_version=2, totalSize=44, EXTERNAL=TRUE, numFiles=1}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE, rewriteEnabled:false, catName:hive, ownerType:USER)

Time taken: 0.096 seconds, Fetched: 4 row(s)

From this output we can see: tableType:EXTERNAL_TABLE

Partitioning

To create partitioned table we just have to add to table definition PARTITIONED BY and list columns. Let's drop previously creater table users and re-create it partitioned by department:

CREATE TABLE IF NOT EXISTS users (

user_name STRING COMMENT 'Name',

user_roles ARRAY<STRING> COMMENT 'Roles',

user_address STRUCT<city:STRING, street:STRING, zip:INT> COMMENT 'Address'

)
PARTITIONED BY (user_department STRING);

Now let's add some data:
hive (mydb)> insert into users (user_name, user_department) values("joe","AAA");
hive (mydb)> insert into users (user_name, user_department) values("moe","BBB");

And let's now check what we have in file system:
dmitry@dmitry-ThinkPad-X220:/user/hive/warehouse/mydb.db/users$ pwd
/user/hive/warehouse/mydb.db/users
dmitry@dmitry-ThinkPad-X220:/user/hive/warehouse/mydb.db/users$ ll
total 16
drwxr-xr-x 4 dmitry dmitry 4096 wrz 12 21:07 ./
drwxr-xr-x 3 dmitry dmitry 4096 wrz 12 21:04 ../
drwxr-xr-x 2 dmitry dmitry 4096 wrz 12 21:07 'user_department=AAA'/
drwxr-xr-x 2 dmitry dmitry 4096 wrz 12 21:07 'user_department=BBB'/

Values with different department are located in different folders.

Exporting data

To export data from tables we can use next syntax:
hive> INSERT OVERWRITE LOCAL DIRECTORY '/home/dmitry/hive-export' SELECT * FROM GROUPS;

Now we can check output directly from hive:
hive> ! less /home/dmitry/hive-export/000000_0;
1Sysdba
2Dev
3Others

Loading data

To load data we can use next syntax:
LOAD DATA LOCAL INPATH '/home/dmitry/hive-export' OVERWRITE INTO TABLE GROUPS;

The end

And this is basically the end of this post:)

development blog

Monday, September 23, 2019

Apache Hive getting started