RANK() vs DENSE_RANK() vs ROW_NUMBER() in Hive (or) Differences between RANK(), DENSE_RANK() and ROW_NUMBER() (or) Ranking window functions in Hive

One of the most frequent questions during Data Engineering interviews. These are called Ranking functions in Hive. These are the the functions that return distinct values based on the rank of a given row but each of them give ranks differently in numbers.

 

RANK()

RANK() function can be used to rank the rows based on a column's value. But here we need to know an interesting thing about RANK() function. if there is a tie between N previous records for the value in the ORDER BY column, the RANK functions skips the next N-1 positions before incrementing the counter. For instance, in the above result, there is a tie for the values in the power column between the 1st and 2nd rows, therefore the RANK function skips the next (2-1 = 1) one record and jumps directly to the 3rd row.

 

   select dealer_id, emp_name, sales, rank() over(order by sales) as `rank` from q1_sales;
   +------------+-----------------+--------+-------+
   | dealer_id  |    emp_name     | sales  | rank  |
   +------------+-----------------+--------+-------+
   | 1          | Raphael Hull    | 8227   | 1     |
   | 3          | May Stout       | 9308   | 2     |
   | 2          | Haviva Montoya  | 9308   | 2     |
   | 1          | Jack Salazar    | 9710   | 4     |
   | 3          | Abel Kim        | 12369  | 5     |
   | 3          | Ursa George     | 15427  | 6     |
   | 2          | Beverly Lang    | 16233  | 7     |
   | 2          | Kameko French   | 16233  | 7     |
   | 1          | Ferris Brown    | 19745  | 9     |
   | 1          | Noel Meyer      | 19745  | 9     |
   +------------+-----------------+--------+-------+
   10 rows selected (0.174 seconds)

In the above sample run, we areĀ  selecting the columns dealer_id, emp_name, sales and rank() function over a window. Inside the window we are just taking sales order by ascending(default order ascend if we don't mention any). So in the result we can see a column called rank, which was the alias name given in the query. If we observe the result carefully, there is tie between second and third row and both are given same rank that is 2. When it comes to next row, the rank 2 is give two times, which means one rank is skipped there. So the next row is given rank 4. In this way RANK() function skips n-1 ranks when there tie between n rows.

 

DENSE_RANK()

DENSE_RANK() is also same as RANK() function but it doesn't skip any ranks even if there is tie between the rows.

   select dealer_id, emp_name, sales, dense_rank() over(order by sales) as denserank from q1_sales; 
   +------------+-----------------+--------+------------+
   | dealer_id  |    emp_name     | sales  | denserank  |
   +------------+-----------------+--------+------------+
   | 1          | Raphael Hull    | 8227   | 1          |
   | 3          | May Stout       | 9308   | 2          |
   | 2          | Haviva Montoya  | 9308   | 2          |
   | 1          | Jack Salazar    | 9710   | 3          |
   | 3          | Abel Kim        | 12369  | 4          |
   | 3          | Ursa George     | 15427  | 5          |
   | 2          | Beverly Lang    | 16233  | 6          |
   | 2          | Kameko French   | 16233  | 6          |
   | 1          | Ferris Brown    | 19745  | 7          |
   | 1          | Noel Meyer      | 19745  | 7          |
   +------------+-----------------+--------+------------+
   10 rows selected (0.198 seconds) 

This query is also same as the previous query that was used while explaining RANK() function but there is only one change, instead of RANK() function we replaced it with DENSE_RANK() function and alias also. If we observe the result, even though there is tie between row 2 and row 3, whose rank is same, fourth row got rank 3, whereas the same in case of RANK() is 4. So DENSE_RANK() never skips any rank even if there is tie.

 

ROW_NUMBER()

As the name of this function indicates, it just returns the row number of each record starting from 1. Unlike the other two functions RANK() and DENSE_RANK() this will not consider any tie between the rows even if there are same values present in two different rows. That is why ROW_NUMBER() can be used to delete duplicate records in Hive.

Click Here: to know how to delete duplicate records in hive.

  select dealer_id, emp_name, sales, row_number() over(partition by dealer_id order by sales) as `rownum` from q1_sales;
   +------------+-----------------+--------+---------+
   | dealer_id  |    emp_name     | sales  | rownum  |
   +------------+-----------------+--------+---------+
   | 1          | Raphael Hull    | 8227   | 1       |
   | 1          | Jack Salazar    | 9710   | 2       |
   | 1          | Ferris Brown    | 19745  | 3       |
   | 1          | Noel Meyer      | 19745  | 4       |
   | 2          | Haviva Montoya  | 9308   | 1       |
   | 2          | Beverly Lang    | 16233  | 2       |
   | 2          | Kameko French   | 16233  | 3       |
   | 3          | May Stout       | 9308   | 1       |
   | 3          | Abel Kim        | 12369  | 2       |
   | 3          | Ursa George     | 15427  | 3       |
   +------------+-----------------+--------+---------+
   10 rows selected (0.241 seconds)

Please observe the result that it just returned the row number for each row. Neither repeated values of ROW_NUMBER for any two rows nor skipping of the ranks.

 

Please leave a comment about this post..

Leave a Reply

Your email address will not be published. Required fields are marked *