Difference between withColumn and select() on DataFrame in Spark.

While working with Big Data applications we might have used the methods withColumn and select(). They both are few of the most frequently used functions. Although they provide similar functionality they are to be used very carefully. The function withColumn gives worst performance in some cases and Spark job will stuck in generating the execution plan itself. Let's see why it is happening..

 

Let us take an example of a Hive table with lot of columns. All the columns in the table are of type String and we want them to be Integer type. For this requirement we can use the code snippet as shown below:

val modifiedDF = originalDF.columns.foldLeft(originalDF)
                 ((current, c) => 
                   current.withColumn(c, col(c).cast(StringType)))

The above snippet works for our requirement but it gives us worst performance. The reason is DataFrames in Spark are immutable. The function withColumn will be called for each and every column that is to be converted to Integer and each time it creates a new DataFrame. If you have hundreds or thousands of columns, calling withColumn for so many times will lead to performance bottleneck.

 

To avoid this we have another method called select. To get the same result for our requirement we can use code snippet shown below:

df.select(df.columns.map(c => col(c).cast(StringType)) : _*)

The above code also gives us the same result but it much more efficient than first one. Because in this case only one DataFrame will be created. So there won't be any performance issues.

Leave a Reply

Your email address will not be published. Required fields are marked *