What is Bucketing?
Why do we need Bucketing?
How it is going to improve query performance?
Bucketing is a way of segregating or diving the data into small data sets using hashing.
Why do we need Bucketing when we have partitioning which does the similar job?
Partitioning will create a folder for each value of partitioned column and stores the data in files inside these folder. If have large volume of data for some values and small volume for some other values, partitioning will result in unequal sized partitions. For example we want to store population details of India, which is 1.2 billion and Japan, which has 0.126 billion population then partitioning on country column will result two folders with unequal sized part files. Partitioning fail in this kind of scenarios. So solve this Bucketing was introduced in Hive.
- Bucketing is done based on Hashing and hash function depends on the type of the bucketed column
- Records with same hash will go and stored in the same bucket
- Bucketing can done along with Partitioning
- Bucketing creates almost equal part file in storage
- For each bucket one file will be created
Advantages of Bucketing
- Bucketed tables give better performance compared to non-bucketed tables
- Bucketed tables give faster results on querying compared to non-bucketed tables
- Map side joins will be faster
- Bucketing provides the flexibility to sort the records in each bucket to be sorted by one or more columns
How to create bucketed tables?
To use the bucketing feature we must below property to true as show below:
set hive.enforce.bucketing = true;
Below is a generic code to create a bucketed table:
CREATE TABLE table_name (col_1 data_type, col_2 data_type ..) CLUSTERED BY (col_n) INTO 4 BUCKETS;