Why do we need partitioning in Hive?
What are the types of partitioning in Hive?
Above are example questions that you might have came across during the interviews. Because partitioning is such wonderful and useful concept in Hive. Here's the blog covering all the sub topics of Hive partitioning.
What is Partitioning in Hive?
Partitioning is a way of dividing the data into small parts based on the value of columns. Using partitions it will be very easy to execute queries on chunks data instead querying whole data.
What is the use of partitioning in Hive?
Partitioning will split the large data into small chunks of data. And the chunks will contain the data that is relevant to a particular key. Usually when you query on Hive tables, then Hive engine converts queries into MapReduce and processes them. MapReduce jobs will read the entire data that is need and does the operations that are mentioned in query. Whole data will be read by MapReduce which includes the unnecessary data. So to avoid this we can divide the data by some criteria and do the operations on them. This division of data is partitioning. When we have Hive tables partitioned, then the queries will take the necessary data only, so the disk I/O will be reduced and the processing time also.
How to create partitions in Hive?
Creating partitions in Hive is very simple. You just need to mention the partitioning columns using PARTITIONED BY clause while creating the table. Below is the sample generic query to create a Hive table with partitions.
CREATE TABLE table_name (col_1 data_type, col_2 data_type ..) PARTITIONED BY (partition_col_1 data_type, partition_col_2 data_type ..);
Types of Partitioning in Hive?
Hive has two types of Partitioning. They are Static partitioning and Dynamic partitioning.
Static partitioning - In static partitioning user needs to add the data to individual partitions. Static partitioning is used when we need to load large data files into Hive. Static partitioning saves lot of time because we will just create the partition and move the data to the particular partition location. To use the static partitioning feature we must the property to strict as: hive.mapred.mode = strict. This property will have false by default in hive-site.xml. Static partitioning can be done on both managed or external tables. We can alter static partitions. To use LIMIT clause with static partitioning we should use WHERE clause.
Dynamic partitioning - In this partitioning we will insert data into the table with one query, there is no need to insert data into individual partitions. Dynamic partitioning can be using an non-partitioned partitioned to make a partitioned out of it. As we don't specify the partitioning column value while inserting the data, dynamic partitioning will take more time to insert the data. Dynamic partitioning can used in cases where we don't know how many partitions can be created for a column. To use LIMIT clause we don't need to use WHERE clause in case of Dynamic partitioning. We can't alter the dynamic partitions. Dynamic partitioning can done on both static and dynamic partitions. To enable dynamic dynamic partitioning the property hive.mapred.mode should be set to non-strict in hive-site.xml.
Advantages of Partitioning
- Partitioning distributes the data horizontally and the execution load also will be distributed accordingly.
- Queries on partitioned columns will faster than non-partitioned columns. Because entire table need not to be scanned for searching the results.
Disadvantages of partitioning
- Partitioning always creates a folder for each partition. If we have lot of different values for partitioning columns then partitioning will create lot of folders with small files.
- Partitioning may lead to Data Skewing, which means that uneven distribution of the data. Assume we have a data two months out of which one is having the sales data during season and other has data of non-season, which has lesser sized data. If we create partitioning on the month, then query execution will give worst performance.