In this post we will see how we can extract unique records from a Hive table. This can be achieved in three ways. One is using the analytical function ROW_NUMBER() and another approach is using DISTINCT keyword.
Let's look at the query to achieve this and then get into explanation. Below is the query to delete duplicate records in a Hive table using ROW_NUMBER() function.
INSERT OVERWRITE TABLE my_table SELECT tab.id, tab.NAME FROM (SELECT id, name, ROW_NUMBER() OVER ( partition by ID ORDER BY id, NAME) AS row_num FROM my_table) as tab WHERE tab.row_num= 1;
In the above query three parts are there. First one is INSERT, second is an outer SELECT query and there is an inner SELECT query too. Let's from lower level that is inner SELECT query. In this query we are selecting id, name along with ROW_NUMBER() over the partition id and name. Means it creates partitions of row where both id and name are equal to id and name any other row. And for row in the partition it just give the row number starting 1 without skipping any row. So inside each partition each duplicate row gets a unique number. For this inner query one alias is given that "tab". In the outer SELECT query we are selecting tab.id, tab.name with a WHERE condition tab.row_num=1. So inner query returns rows with identical values along with ranks but we are filtering only records with row number 1. So we will get only unique records. Finally we are INSERTing and Over writing to same table.
Take a look at the below query.
INSERT OVERWRITE TABLE my_table SELECT DISTINCT * FROM my_table;
In the SELECT query we are extracting all the columns by applying DISTINCT. So it returns all the unique records. And then we are inserting and overwriting the same table.
Please let us know what you think of this post..