How can we optimize a Hive job?
As we deal with data of size terabytes and petabytes the processing time very high and even the CPU resources usage. We must use the resources in an optimized way along with lesser processing time. Here are some optimization techniques which can lead to efficient performance.
Bucketing is a way of dividing the data into small data sets on which the operations will be performed. If we enable bucketing one column then a folder will be created for that partition with partition column value as name. Whenever we query on that column then the other partitions will be ignored. This will increase the performance.
Use Tez execution engine
Tez is an execution engine that is built on YARN. Tez also uses DAG to execute complex tasks. Tez is more powerful and efficient than MapReduce engine. Using Tez we can achieve better performance. To enable Tez as execution we must set the property hive.execution.engine to tez.
Use right file format
It is practically proven that ORC gives best performance compared text, sequence or any other file format with Hive. ORC files have the metadata which contains max, min, sum and other metadata of the row groups. So ORC works well when it comes to Hive.
After partitioning the data based on values of some columns, there might be huge data left in partitions. In this scenario there won't be much improvement in performance. To solve this problem Hive offers one more technique called Bucketing. Bucketing divides large data into small files using hashing. When we perform operations on this data we will get better performance.
Apply compression on intermediate data
Compression is a boon for Big data environment. It reduces the size of data but the data will remain as it is. MapReduce programs always need data shuffling between mappers and reducers. Using compression for intermediate data will reduce the network traffic if there is any shuffling. We can configure this feature by setting the below properties to true.
To compress map output: set mapred.compress.map.output = true
To compress job output: set mapred.output.compress = true
Use Broadcast join
Also called as Map side join. This is similar to Spark's broadcast join. If we large data set which can fit into memory then you can broadcast it to all the nodes, so that nodes don't need to search for the data every time they require.
Below is a sample query to show how to perform broadcast join in Hive:
select /*+ MAPJOIN(table1) */ table1.* from employee table1, departments table2 where table1.col_1=table2.col_2 ;
This is force bucket join. Apart form this we can set some properties in Hive in such a way that Hive automatically converts them to Map joins.
Try not to use global sorting
Use SORT BY clause instead of ORDER BY. Because ORDER BY sorts the results globally, means overall output will be in sorted order, whereas SORT BY sorts data per reducer but overall output will not be sorted. In most of the cases we don't need the data to be sorted globally as we deal with large data sets. So avoid global sorting.
Vectorization is feature available in Hive, which enables Hive engine to process groups rows instead of single row at a time. This will the performance of reads, joins and aggregations. To use this feature set the below properties.
set hive.vectorized.execution.enabled = true;