What is the difference between RDD, Dataframe and Dataset?
Compare RDD, Dataframe and Datasets.
Spark release version
RDD - RDDs are native API of Spark. They are introduced in Spark 1.0 version
Dataframes - Released in 1.3 version of Spark
Dataset - These are introduced in 1.6 version of Spark
RDD- RDDs use lazy evaluation. Means that they process the data when any action is performed.
Dataframe - Dataframes also uses lazy evaluation
Dataset - Datasets use lazy evaluation
Programming Language support
RDD - RDD API is available in multiple languages like Java, Scala, Python and R. Can be considered as a flexibility offered by RDDs for the developers.
Dataframe - It is also available in all the languages in which RDD is also available.
Dataset - Dataset is available currently in Java and Scala languages only. Python and R are not supporting datasets right now.
RDD - They provide compile time safety.
Dataframe - They provide run-time safety. If you try to access any column that is not present in dataframe, it throws error at run time.
Dataset - Dataframes provide compile time safety.
RDD - Can work with structured and unstructured data. But it cannot have well defined schema. User need to work little more to use RDDs
Dataframe - Can work with structure and semi structured data. It can organize the data in the name columns. So it can have schema embedded within it.
Dataset - It can efficiently process structured and unstructured data. It represent data in the form of Java objects of row objects.
RDD - RDD can be created from different sources like text file, JDBC connection etc.
Dataframe - Dataframe can be created from different file formats or from any RDD.
Dataset - Datasets also can be created from different file formats as well as from RDDs.
RDD - No optimization engines are embedded in RDDs. RDDs won't come with any optimizers like Catalyst optimizer or Tungsten optimizer. So they give worst performance when working with structured data. Developer needs to optimize them using coding techniques.
Dataframe - It has Catalyst Optimizer, which is the inbuilt optimizer in Dataframes.
Dataset - Datasets also come with Catalyst Optimizer.
RDD - Spark RDD distributed over the cluster in the form of Partitions. If there is any shuffling or any partition needs to be written to disk Spark uses Java Serialization and this process is very costly as it takes more time and CPU resources.
Dataframe - Spark dataframes can serialize the data into memory(off - heap memory) in binary format and can perform all the transformation operations directly in memory itself. So thee is no need of Java serialization in case of Dataframes. It uses Tungsten execution in the backend to generate the byte code for the expressions evaluated, which explicitly manages the memory.
Dataset - In case of Datasets, they will have encoders which handle the communication between JVM objects to tabular representation. Dataset stores the tabular representation using Tungsten binary format. Dataset allows the user to perform operations directly on the serialized data, which improves the memory usage.
RDD - RDD has lot of memory overhead.
Dataframe - It has lesser garbage collection compared to RDD.
Dataset - There is no need of garbage collector as it Tungsten serialization, which uses off heap data serialization.
RDD - RDDs will give worst performance while doing aggregations on them
Dataframes - As they have optimizers embedded in them, they give good performance and faster results when we perform aggregation operations.
Datasets - Datasets also provide faster result as they have Catalyst optimizer.
RDD - RDDs won't have any schema, so we can't perform operations using column names.
Dataframe - Dataframes can infer the schema from the files and users also can define a custom schema.
Dataset - Dataset also can infer the schema from the files using Spark SQL framework and user can also define the schema.