zip, zipWithIndex and zipWithUniqueId in Spark

These functions are little rarely used in Spark as they confined to be used with RDDs only and RDDs are commonly used very rarely as they lack performance enhancements and structured schema. But these are very useful in case if we have to use RDDs to process our data. Let's look into them one by one..

 

Zip

Zip() method is used to combine two RDDs by joining corresponding rows from each RDD. As they join each row from two different RDDs we should make sure that both the RDDs have the same number of elements in them otherwise Spark will throw error.

Here I'll construct two RDDs with same number of elements and join them using zip() function.

val rdd = sc.parallelize(Seq(1,2,3,4,5,6,7))
val rdd1 = sc.parallelize(Seq(1,2,3,4,5,6,7))

rdd.zip(rdd1).collect.foreach(println)

O/p:
(1,1)
(2,2)
(3,3)
(4,4)
(5,5)
(6,6)
(7,8)

 

zipWithIndex

zipWithIndex is an operation that should be performed on a single RDD. When we call this method on an RDD, Spark will assign an incremental index of type Long to each of the elements and the index will be in ascending order.

Below is the code that shows how to use zipWithIndex() method and it's output:

rdd.zipWithIndex.collect.foreach(x=>println(x._1+" "+x._2))

O/p:
1 0
2 1
3 2
4 3
5 4
6 5
7 6

 

zipWithUniqueId

This method also assigns a unique index value to each element of an RDD but it won't maintain any order as it assigns index using Hashing.

Below is the code that shows how to use zipWithUniqueId() method and it's output:

rdd.zipWithUniqueId.collect.foreach(x=>println(x._1+" "+x._2))

O/p:
1 0
2 2
3 4
4 1
5 3
6 5
7 7

Leave a Reply

Your email address will not be published. Required fields are marked *