Spark Paired RDD is nothing but an RDD that contains key, value pairs. Key-value pairs are linked data items. Keys are identifiers of the values corresponding to them. Key-value pair RDD possess some special operations on them, like distributed shuffling, aggregation or grouping by the keys. Spark Paired RDDs containing Tuple2 objects in Scala, these operations will be available automatically. Operations for key-value pairs are available in Pair RDD functions class, however it wraps around Spark RDD of tuples.
How to create Paired RDD?
There are many ways to create a Paired RDD. One of them is using map function which returns key, value pairs.
Below code snippet creates a paired RDD of type word and int:
rdd.flatMap(line => line.split(" ")).map(word => (word,1))
Code snippet shown below finds the words count in paired RDD:
rdd.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_).collect
Spark Paired RDD operations
Tansformation operations -
groupByKey - Groups the values based on a key
reduceByKey - Combines values with same key
combineByKey - This uses a result of different type, then combine those values with the same key.
keys - returns the keys of the paired rdd.
values - returns the values of the paired rdd
sortByKey - returns the RDD sorted by the keys
Actions transformations -
countByKey - counts the number of values for each key
collectAsMap - collects the result as map to provide easy access and looking up for the values.
lookup - returns all the values that are associate with the key provided.