What is Skewness in Data?
Data skew means data is distributed unevenly or asymmetrically.
Let's try to understand this in better way. Assume that you are data engineer and working at some organization. You got a task to analyze huge amounts of data of people from different countries. You designed a MapReduce job for that and it is taking lot of time. Now you checked the status of the job on UI. Which shows that all the MR jobs completed except for India and China.
MapReduce job will launch several Mappers and Reducers. Reducers will start once Mappers done with their tasks. Reducers will process the data based on Keys. If we have used country name as key, then the reducers working on India and China data will take lot of time as the population in these countries is more compared to other countries. So the other reducer jobs will be completed but not for these two countries. As the population of India and China is more, the data will contain more number of rows for these two countries and the same is lesser for other countries. So this kind of unevenly distributed data will be called as Skewed Data.
There is already a post explains how can we handle skewed data in Bigdata applications.(Click here to read the post)
In this post we will completely concentrate on Skew join in Hive alone. Let's get into it..
What is Skew Join in Hive?
When there is skewness in data then we can use Hive's skew join feature. On defining what is skewed table, it is a table that is having values that are present in large numbers in the table compared to other data. However, while the rest of the data is stored in a separate file Skew data is stored in a separate file.
How to get rid of this?
1). Setting Hive configurations
set hive.optimize.skewjoin=true; set hive.skewjoin.key=500000; set hive.skewjoin.mapjoin.map.tasks=10000; set hive.skewjoin.mapjoin.min.split=33554432;
hive.optimize.skewjoin --> Tells the Hhive engine to enable skew join optimization. The algorithm is as follows: At runtime, detect the keys with a large skew. Instead of processing those keys, store them temporarily in an HDFS directory. In a follow-up map-reduce job, process those skewed keys. The same key need not be skewed for all the tables, and so, the follow-up map-reduce job (for the skewed keys) would be much faster, since it would be a map-join.
hive.skewjoin.key --> Determine if we get a skew key in join. If we see more than the specified number of rows with the same key in join operator, we think the key as a skew join key.
hive.skewjoin.mapjoin.map.tasks --> Determine the number of map task used in the follow up map join job for a skew join. It should be used together with hive.skewjoin.mapjoin.min.split to perform a fine grained control.
hive.skewjoin.mapjoin.min.split --> Determine the number of map task at most used in the follow up map join job for a skew join by specifying the minimum split size. It should be used together with hive.skewjoin.mapjoin.map.tasks to perform a fine grained control.
How Skew join works in Hive?
However, let’s assume if table A join B, and A has skew data “1” in joining column.
At First store, the rows with key 1 in an in-memory hash table and read B. Further to read A run a set of mappers. Afterward, do the following:
- Make sure to use the hashed version of B to compute the result since it has key 1.
- Then, send all other keys to a reducer which does the join. Basically, from a mapper, this reducer will get rows of B also.
Hence, as a result, we end up reading only B twice. Basically, that implies that the skewed keys in A are only read and processed by the Mapper. Also, they are not sent to the reducer. Moreover, remaining keys in A go through only a single Map/Reduce.
However, the assumption is that B has few rows with keys which are skewed in A. Hence, in this way these rows can be loaded into the memory.
Please comment your thoughts about this post.