What is the differences among groupByKey, reduceByKey and aggregateByKey?
goupByKey - While using groupByKey we must be careful because sometime it can cause the driver program to run into memory issues, resulting in out of memory exceptions. This is because merging of the values for each will happen after the shuffling. So almost all the data that is given for processing will be shuffled first and then merging will happen. So the performance of groupByKey is not efficient.
Below is the code for word count program in Spark using groupByKey function:
val rdd = sparkContext.textFile("file_path_location") rdd.flatMap(line => line.split(" ") ).map(word => (word,1.toInt)).groupByKey().map(tup => (tup._1, tup._2.sum)).collect
reduceByKey - It gives better performance when compared to groupByKey, because reduceByKey uses combiner. So before shuffling the data first the values for each key will be merged and then shuffling will happen. So it reduces lot of network traffic by using combiner and also workload on driver program. Although these two functions produce the same result, reduceByKey gives better performance.
val list = List("one", "two", "four", "one", "three", "two", "five", "nine", "ten", "six", "three") spark.sparkContext.parallelize(list).map(word => (word,1)).reduceByKey(_+_).collect.foreach(println)
aggregateByKey - It is same as reduceByKey but it takes initial values and combiner logic.
1 thought on “Spark groupByKey vs reduceByKey vs aggregateByKey”
I as well as my guys happened to be going through the best helpful hints located on the website and so instantly came up with a terrible feeling I never thanked the web blog owner for those tips. All the young boys are already consequently excited to study all of them and already have simply been enjoying those things. Many thanks for simply being simply helpful and then for finding varieties of brilliant ideas most people are really desperate to understand about. Our own sincere regret for not expressing gratitude to sooner.