Explain about Grouping Sets in Hive (or) Grouping Sets in SQL?

What is the use of GROUPING SETS clause in Hive queries?

This is little bit rarely used clause but it has so many advantages. Knowing the functionality of this is very important because you may come across a situation in which you might need to use this. Along with this clause we need to know about few more clauses like Grouping function, Grouping__ID, Cube and Rollup.

 

What is GROUPING SETS clause?

Grouping Sets clause in Group By clauase allows the user to perform more than one Group By operation in the same record set. Any GROUPING SET can be expressed as a multiple Group By operation connected by UNION operation.

What is Grouping function?

The grouping function indicates whether an expression in a GROUP BY clause is aggregated or not for a given row.

The value 0 represents a column that is part of the grouping set, while the value 1 represents a column that is not part of the grouping set. 

Example:

SELECT col1, col2, GROUPING(col1), GROUPING(col2)
FROM table1
GROUP BY col1, col2 WITH ROLLUP;

*You will get to know about ROLLUP used in the above query as you move forward in this blog

What is Grouping__ID?

(Please observe the double underscore in the above function.)

This is a function in SQL that is used to identify in which group the row is used in aggregation. GROUPING__ID will concatenate all the values returned by GROUPING function on the aggregations columns, perform binary to decimal conversion and returns the equivalent integer. You will understand this function when you look at an example that is mentioned below.

Input data:

 

Query for Grouping__ID:

SELECT key, value, GROUPING__ID, count(*)
FROM T1
GROUP BY key, value WITH ROLLUP;

The above query produces the below result:

GROUPING__ID(col1, col2) = GROUPING(col1) + GROUPING(col2)

ROLLUP

ROLLUP is a built in function that was built on GROUPING SETS. A typical query with ROLLUP looks like below:

SELECT key, value
FROM T1
GROUP BY key, value WITH ROLLUP;

This query is equivalent to below using GROUPING SETS:

SELECT key, value
FROM T1
GROUP BY GROUPING SETS
(
    (key, value),
    (key),
    ()
);

 

CUBE

CUBE is also an in-built function, which creates a subtotal of all possible combinations of the set of column in its argument. Once we compute a CUBE on a set of dimension, we can get answer to all possible aggregation questions on those dimensions.

SELECT col1, col2, col3
FROM T1
GROUP BY CUBE(col1, col2, col3);

This query is equivalent to below using GROUPING SETS:

SELECT key, value
FROM T1
GROUP BY GROUPING SETS
(
    (col1, col2, col3),
    (col1, col2),
    (col1, col3),
    (col1),
    (col2, col3),
    (col2),
    (col3),
    ()
);

 

Difference between CUBE and ROLLUP:

CUBE generates a result set that shows aggregations for all the combinations of values in the selected columns, whereas ROLLUP generates a result set that shows aggregations for a hierarchy of values in the selected columns.

Below is the table that shows what are the aggregations will be performed internally by using GROUPING SETS() function when we call CUBE and ROLLUP functions on a set of columns:

 

Please comment your thoughts about this post and also let us know if you want any topic to be covered on our blog.

Leave a Reply

Your email address will not be published. Required fields are marked *