This post will focus on calculating moving average or sum using Hive queries. We might have come across this question during Data Engineering interviews. This can be done by using Window operations using PARTITION BY clause in queries. And we will see some more new words here like ROWS, BETWEEN, UNBOUNDED, PRECEDING, FOLLOWING. All these functions are not used frequently but if we may come across any situation to use these. Let's see how can use them.
What is moving sum or moving average?
Assume that we are working analysis of some eCommerce retail data and we need to calculate average of sales each day from the last one month or you need to calculate total sales from day 1 till today. This kind of calculations are called moving average or sum.
SELECT date, sales, AVG(sales) OVER wind AS avg_sales FROM sales_table WINDOW wind AS (PARTITION BY sales ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW);
Above is almost self explanatory. Here new terms are WINDOW, ROWS BETWEEN, UNBOUNDED PRECEDING, CURRENT ROW. This query will run on sales_table and will extract the date wise sales along with average sales from last 30 days including today. For this we have created a WINDOW with the name wind and we used PARTITION BY clause to create this window. Inside this window created partitions of 29 previous rows and the current row, which are indicated by ROWS BETWEEN 29 PRECEDING AND CURRENT ROW. In the SELECT clause we extracted date, sales for today and average of sales over the partition. This is how it works.
If we have to calculate average sales from day 1 till today the we can use below query:
SELECT date, sales, AVG(sales) OVER wind AS avg_sales FROM sales_table WINDOW wind AS (PARTITION BY sales ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW);
To calculate average over total sales till end then we can below query:
SELECT date, sales, AVG(sales) OVER wind AS avg_sales FROM sales_table WINDOW wind AS (PARTITION BY sales ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);
If we want to calculate average over last three records and next three records then we can use below query:
SELECT date, sales, AVG(sales) OVER wind AS avg_sales FROM sales_table WINDOW wind AS (PARTITION BY sales ORDER BY date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING);
The same queries can written without using WINDOW function. For example like this:
SELECT AVG(sales) OVER (PARTITION BY sales ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS avg_sales FROM sales_table;
Difference between GROUP BY and PARTITION BY
We might think that this can be done by GROUP BY as well. But there is a difference between GROUP BY and PARTITION BY. PARTITION BY
and ORDER BY
works like GROUP BY
and ORDER BY
after the WHERE
clause, except it doesn’t collapse the rows, but only divides them into non-overlapping partitions to work on. So using GROUP BY we can't get results for each and every existing row but it gives result for groups only. So after the query execution we won't the existing rows in the result.
Please what else you feel can appear on this blog and if we have missed Data Engineering concepts.
Comment your thoughts about this post.