How to calculate moving sum or moving average in Hive?

This post will focus on calculating moving average or sum using Hive queries. We might have come across this question during Data Engineering interviews. This can be done by using Window operations using PARTITION BY clause in queries. And we will see some more new words here like ROWS, BETWEEN, UNBOUNDED, PRECEDING, FOLLOWING. All these functions are not used frequently but if we may come across any situation to use these. Let's see how can use them.

 

What is moving sum or moving average?

Assume that we are working analysis of some eCommerce retail data and we need to calculate average of sales each day from the last one month or you need to calculate total sales from day 1 till today. This kind of calculations are called moving average or sum.

SELECT
date, sales, AVG(sales) OVER wind AS avg_sales
FROM sales_table
WINDOW wind AS (PARTITION BY sales ORDER BY date
                ROWS BETWEEN 29 PRECEDING AND CURRENT ROW);

Above is almost self explanatory. Here new terms are WINDOW, ROWS BETWEEN, UNBOUNDED PRECEDING, CURRENT ROW. This query will run on sales_table and will extract the date wise sales along with average sales from last 30 days including today. For this we have created a WINDOW with the name wind and we used PARTITION BY clause to create this window. Inside this window created partitions of 29 previous rows and the current row, which are indicated by ROWS BETWEEN 29 PRECEDING AND CURRENT ROW. In the SELECT clause we extracted date, sales for today and average of sales over the partition. This is how it works.

If we have to calculate average sales from day 1 till today the we can use below query:

SELECT
date, sales, AVG(sales) OVER wind AS avg_sales
FROM sales_table
WINDOW wind AS (PARTITION BY sales ORDER BY date
                ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW);

To calculate average over total sales till end then we can below query:

SELECT
date, sales, AVG(sales) OVER wind AS avg_sales
FROM sales_table
WINDOW wind AS (PARTITION BY sales ORDER BY date
                ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);

If we want to calculate average over last three records and next three records then we can use below query:

SELECT
date, sales, AVG(sales) OVER wind AS avg_sales
FROM sales_table
WINDOW wind AS (PARTITION BY sales ORDER BY date
                ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING);

 

The same queries can written without using WINDOW function. For example like this:

SELECT 
AVG(sales) OVER (PARTITION BY sales ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS avg_sales
FROM sales_table;

 

Difference between GROUP BY and PARTITION BY

We might think that this can be done by GROUP BY as well. But there is a difference between GROUP BY and PARTITION BY. PARTITION BY and ORDER BY works like GROUP BY and ORDER BY after the WHERE clause, except it doesn’t collapse the rows, but only divides them into non-overlapping partitions to work on. So using GROUP BY we can't get results for each and every existing row but it gives result for groups only. So after the query execution we won't the existing rows in the result.

 

 

Please what else you feel can appear on this blog and if we have missed Data Engineering concepts.

Comment your thoughts about this post.

Leave a Reply

Your email address will not be published. Required fields are marked *