Spark groupByKey vs reduceByKey vs aggregateByKey

Similar questions:

What is the differences among groupByKey, reduceByKey and aggregateByKey?

goupByKey - While using groupByKey we must be careful because sometime it can cause the driver program to run into memory issues, resulting in out of memory exceptions. This is because merging of the values for each will happen after the shuffling. So almost all the data that is given for processing will be shuffled first and then merging will happen. So the performance of groupByKey is not efficient.

Below is the code for word count program in Spark using groupByKey function:

val rdd = sparkContext.textFile("file_path_location")

rdd.flatMap(line => line.split(" ") ).map(word => (word,1.toInt)).groupByKey().map(tup => (tup._1, tup._2.sum)).collect

Output:

reduceByKey - It gives better performance when compared to groupByKey, because reduceByKey uses combiner. So before shuffling the data first the values for each key will be merged and then shuffling will happen. So it reduces lot of network traffic by using combiner and also workload on driver program. Although these two functions produce the same result, reduceByKey gives better performance.

Code

val list = List("one", "two", "four", "one", "three", "two", "five", "nine", "ten", "six", "three")
spark.sparkContext.parallelize(list).map(word => (word,1)).reduceByKey(_+_).collect.foreach(println)

Output

 

aggregateByKey - It is same as reduceByKey but it takes initial values and combiner logic.

Leave a Reply

Your email address will not be published. Required fields are marked *