What is Paired RDD in Spark?

Spark Paired RDD is nothing but an RDD that contains key, value pairs. Key-value pairs are linked data items. Keys are identifiers of the values corresponding to them. Key-value pair RDD possess some special operations on them, like distributed shuffling, aggregation or grouping by the keys. Spark Paired RDDs containing Tuple2 objects in Scala, these operations will be available automatically. Operations for key-value pairs are available in Pair RDD functions class, however it wraps around Spark RDD of tuples.

 

How to create Paired RDD?

There are many ways to create a Paired RDD. One of them is using map function which returns key, value pairs.

Below code snippet creates a paired RDD of type word and int:

rdd.flatMap(line => line.split(" ")).map(word => (word,1))

Code snippet shown below finds the words count in paired RDD:

rdd.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_).collect

Output

Spark Paired RDD operations

Tansformation operations -

groupByKey - Groups the values based on a key

reduceByKey - Combines values with same key

combineByKey -  This uses a result of different type, then combine those values with the same key.

keys - returns the keys of the paired rdd.

values - returns the values of the paired rdd

sortByKey - returns the RDD sorted by the keys

Actions transformations -

countByKey - counts the number of values for each key

collectAsMap - collects the result as map to provide easy access and looking up for the values.

lookup - returns all the values that are associate with the key provided.

Leave a Reply

Your email address will not be published. Required fields are marked *