To delete duplicate records in a DataFrame we can either use distinct or dropDuplicates method. But dropDuplicates method comes with…
While working with Big Data applications we might have used the methods withColumn and select(). They both are few of…
Spark has lot of performance enhancement techniques. Two of them are cache() and broadcast variables. Although they both used to…
In this post we will see how we can extract unique records from a Hive table. This can be achieved…
One of the most frequent questions during Data Engineering interviews. These are called Ranking functions in Hive. These are the…
This post will focus on calculating moving average or sum using Hive queries. We might have come across this question…
Hive CLI and Beeline both can be used to interact with Hive execution engine. But there are few differences between…
Map join in Hive has several different names like Auto Map join, Map side join and Broadcast join. It is…
Adaptive Query Execution(AQE) Spark is one of the vastly used frameworks in Data Engineering to process huge data. As…
What is the use of GROUPING SETS clause in Hive queries? This is little bit rarely used clause but it…