What is the difference between Map and FlatMap in spark?

One of the most common interview questions in big data developer interviews. I was asked this question in almost all the interviews.

Here is the answer..

Map

When map function is applied on any RDD of size N, the logic defined in the map function will be applied on all the elements and returns an RDD of same length. The input and output will have same number of records.

val rdd = sparkContext.textFile("path_of_the_file")

rdd.map(line=>line.toUpperCase).collect.foreach(println) //This code snippet transforms each line to upper and prints them.

Here is the sample run of map function:

 

FlatMap

This function has little different functionality compared to map function. FlatMap also allows the user to define a logic and the logic will be applied on all the elements in the rdd, but flatMap returns 0 or 1 or more values per element in the rdd. So the size of the input rdd and output might no be same in case of flatMap fucntion. Take look the below code you will understand it clearly.

rdd.flatMap(line=>line.split(" ")).collect.foreach(println)

Sample run:

Looks like flatMap is doing something that is similar to map and parallelize functions together.

Leave a Reply

Your email address will not be published. Required fields are marked *