Insert large data into hive dynamic partitions

Dynamic partitioned hive table can help to store raw data into partitioned form which may be helpful in further querying.

Select query is generally faster when executed on a partitioned table. Its always adviced to create an external table on raw file in HDFS and then insert that data into partitioned table.

But when we try to insert really large file into a dynamically partitioned table it many a times fails due to many files being opened at mapper stage.

Following steps can solve the issue of massive data insert into hive dynamically partitioned table :

Lets say table is like :

CREATE  TABLE IF NOT EXISTS stocks_main(
exchange STRING,
symbol STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
PARTITIONed BY (ymd STRING)  ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’  LOCATION ‘/user/impadmin/NYSE_PARTITIONED’;

and Stocks is an external table created over raw data in HDFS with same schema.

Now we need to pull data from Stocks table into partitioned Stocks_Main table.

We need to set following properties to enable inserts :

set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.parallel=false;
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;

Execute following query to insert into table :

insert into  table stocks_main partition(ymd) select exchange,symbol,price_open,price_high,price_low,price_close,volume,price_adj_close,ymd from stocks;

This results in only mapper programs and in case of many files being opened, the insert operation fails.

This can be sorted out by moving the file writing operation at reducer stage by executing the above query with “distribute by” clause :

insert into  table stocks_main partition(ymd) select exchange,symbol,price_open,price_high,price_low,price_close,volume,price_adj_close,ymd from stocks distribute by ymd;

This will convert only mappers to map reduce and fix problem of huge file writing to hive table.

 

Advertisements

3 thoughts on “Insert large data into hive dynamic partitions

  1. This was helpful to me today. Adding ‘distribute by’ got one my queries working after i spent some time fiddling with various hive parameters. Thanks for sharing this note and saving me some time!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s