Hive – Order By vs Sort By vs Cluster By vs Distribute By

Hive has so many clubbing operations like Order By, Sort By etc. Each clause it's own uses, advantages and disadvantages. These will be most frequently asked interview questions. These may not be asked together but they will be pairs wise. So we thought of covering all of them together in a single for comparability. Let's look at them individually..

 

Order By

Order clause is used to sort the data globally. We can specify DESC to sort the data in decreasing order and ASC to sort in ascending order. Order By clause has some limitations. In strict mode i.e., when we set hive.mapred.mode to strict, then the Hive query must have limit at the end. This is because Order By sorts the data globally, so there should be only one reducer to produce the output. If the input has huge data then one reducer might take lot of time. To avoid that we have to use Limit clause at the end.

Sort By

Sort By clause sorts the data per reducer but not globally. The mail difference between Sort By and Order By is the latter one guarantees global sort of data whereas the former guarantees per reducer sorting of data.

Distribute By

Distribute By clause is used to distribute the values columns among the reducers. All the distribute columns will go to the same reducer. It doesn't sort the data per reducer and not even globally.

Cluster By

This is the resultant command of performing Distribute By and Sort By clauses together. This command is used to distribute the rows among the reducers. Columns mentioned in the Clustered By clause will go to the multiple reducers and ensures that rows are sorted per reducer.

 

Please give us your valuable feedback..

Leave a Reply

Your email address will not be published. Required fields are marked *