Why RDD is immutable?
What is the need of RDD in spark?
RDD - Resilient Distributed Dataset
What is RDD?
RDD is a fundamental data structure of Spark and it is collection of objects which compute in a cluster of nodes. RDD comprises of small data sets called partitions and these partitions will reside on different nodes.
Why is it called Resilient Distributed Dataset?
Resilient - Each RDD constructs Lineage graph while building. If there is any loss or damage to any partition due to any node failure or any other technical issues, Spark can reconstruct the missing partition using this lineage graph.
Distributed - An RDD will be splitted into small data sets called partition. These partitions will reside on different nodes. So is called Distributed data set.
Dataset - RDD can store data that you want to work with from various file formats and from JDBC connection.
How many ways are there to create an RDD?
There are three ways to create an RDD.
1). Using files in HDFS or any other storage
2). By applying transformations on existing RDD
3). By using parallelizing on an existing collection with the use of parallelize() method
Can an RDD be cached or persisted?
Yes. RDD can be cached using cache or persist methods.
Can we change the number of partitions of an RDD?
Yes. We can change the number of partitions of an RDD using repartition or coalesce methods.
What is Lazy Evaluation?
All the transformations on an RDD in Spark are lazy, which means that the result of transformation operations are not calculated as soon as they are called, until an action performed, which requires data to be collected at the driver program.
What is Location-Stickiness in Spark RDD?
RDDs are capable of specifying placement preference for computing the partitions. The DAG scheduler places the partitions as close as possible to the tasks.
Operations that can be performed on Spark RDD
Spark RDD supports two types of operations.
Transformations are the functions that produce one or more rdd as output. Transformations never change the original rdd but they produce new rdd or rdds. Examples of transformations are map, flatmap, reducebyKey etc.
Transformations are of two types.
a). Narrow transformations - These transformations will be performed on single partition independently. Output rdd will contain data that originated from single partition. Eg: map, filter etc
b). Wide transformations - These transformations require data from multiple partitions. They require shuffling of partitions.
Actions in Spark result the final output. Actions trigger lineage graph to execute. This lineage will execute all the transformations that are defined in it. Actions never produce rdds, but they produce file outputs or data at driver program.
Disadvantages of RDDs in Spark
- RDDs don't have any schema defined. So we can't work with structured data using RDDs
- There is not performance optimization in RDDs.