Difference between dropDuplicates() and distinct.

To delete duplicate records in a DataFrame we can either use distinct or dropDuplicates method. But dropDuplicates method comes with an additional feature that supports deleting duplicates based on the values of a subset of columns.

val distinctDF = df.distinct()

The above syntax will remove the duplicate rows from the existing DataFrame df and returns a new DataFrame with unique rows. In this case we cannot specify the columns that are to be considered for deciding the duplicates.

val distinctDF = df.dropDuplicates("column3","column5")

In this case we have indicated that column3 and column5 to be considered for checking duplicates. That means whenever the values between two rows under both of these columns are equal then one of the rows will be deleted.

Leave a Reply

Your email address will not be published. Required fields are marked *