Map join in Hive has several different names like Auto Map join, Map side join and Broadcast join. It is special feature in Hive to improve the performance as a result it speeds up the query execution. It completes the join operation in Map phase itself without the need of running a reducer phase. It loads the smaller table into the memory from join operation and finishes the join without running reducer phase. If queries frequently depend on small table joins, using map joins speed up queries' execution. As no reducers are necessary, map joins are way faster than the regular joins.
MAPJOINs are processed by loading the smaller table into an in-memory hash map and matching keys with the larger table as they are streamed through. The prior implementation has this division of labor:
- Local work:
- read records via standard table scan (including filters and projections) from source on local machine
- build hashtable in memory
- write hashtable to local disk
- upload hashtable to dfs
- add hashtable to distributed cache
- Map task
- read hashtable from local disk (distributed cache) into memory
- match records' keys against hashtable
- combine matches and write to output
- No reduce task
Whenever Hive engine needs to perform Map join, first step is it creates a MapReduce local task. After that it reads the data of smaller table. Then it saves the data into an in-memory Hash table and then into a Hash table. After that it moves the hash table to the Hadoop Distributed Cache while original MapReduce task begins, which will populate the file to each mapper's local disk. Therefore, all the mappers can load this hash table file into memory and do the join in Map phase.
How can we convert normal join to Map Side join in Hive?
There are two different ways to achieve this. First is to use hint. We can write a syntax of a hint in the Hive query which specifies the Hive engine to perform Map Side join or Broadcast join.
Typical syntax using hint looks like below:
select /*+ MAPJOIN(time_dim) */ count(*) from store_sales join time_dim on (ss_sold_time_sk = t_time_sk) join date_dim on (ss_sold_date_sk = d_date_sk) where t_hour = 8 and d_year = 2002
Another way is by setting the below properties in Hive:
hive.auto.convert.join: By default, this option is set to
true. When it is enabled, during joins, when a table with a size less than 25 MB (hive.mapjoin.smalltable.filesize) is found, the joins are converted to map-based joins.
hive.auto.convert.join.noconditionaltask: When three or more tables are involved in the join condition. Using
hive.auto.convert.join, Hive generates three or more map-side joins with an assumption that all tables are of smaller size. Using
hive.auto.convert.join.noconditionaltask, you can combine three or more map-side joins into a single map-side join if size of n-1 table is less than 10 MB. (This rule is defined by
Limitations of Map Side join in Hive:
- Full outer joins are never converted to map-side joins.
- A left-outer join are converted to a map join only if the right table that is to the right side of the join conditions, is lesser than 25 MB in size.
- Similarly, a right-outer join is converted to a map join only if the left table size is lesser than 25 MB.
Please comment you thoughts about this post.
Also please let us know what do you want us to cover on our blog to help aspiring Data Engineers..