Spark RDD vs Dataframe vs Dataset

Similar questions

What is the difference between RDD, Dataframe and Dataset?

Compare RDD, Dataframe and Datasets.

Spark release version

RDD - RDDs are native API of Spark. They are introduced in Spark 1.0 version

Dataframes - Released in 1.3 version of Spark

Dataset - These are introduced in 1.6 version of Spark

Lazy Evaluation

RDD- RDDs use lazy evaluation. Means that they process the data when any action is performed.

Dataframe - Dataframes also uses lazy evaluation

Dataset - Datasets use lazy evaluation

Programming Language support

RDD - RDD API is available in multiple languages like Java, Scala, Python and R. Can be considered as a flexibility offered by RDDs for the developers.

Dataframe - It is also available in all the languages in which RDD is also available.

Dataset - Dataset is available currently in Java and Scala languages only. Python and R are not supporting datasets right now.

Compile-Time Safety

RDD - They provide compile time safety.

Dataframe - They provide run-time safety. If you try to access any column that is not present in dataframe, it throws error at run time.

Dataset - Dataframes provide compile time safety.

Data formats

RDD - Can work with structured and unstructured data. But it cannot have well defined schema. User need to work little more to use RDDs

Dataframe - Can work with structure and semi structured data. It can organize the data in the name columns. So it can have schema embedded within it.

Dataset - It can efficiently process structured and unstructured data. It represent data in the form of Java objects of row objects.

Data sources

RDD - RDD can be created from different sources like text file, JDBC connection etc.

Dataframe - Dataframe can be created from different file formats or from any RDD.

Dataset - Datasets also can be created from different file formats as well as from RDDs.

Optimization

RDD - No optimization engines are embedded in RDDs. RDDs won't come with any optimizers like Catalyst optimizer or Tungsten optimizer. So they give worst performance when working with structured data. Developer needs to optimize them using coding techniques.

Dataframe - It has Catalyst Optimizer, which is the inbuilt optimizer in Dataframes.

Dataset - Datasets also come with Catalyst Optimizer.

Serialization

RDD - Spark RDD distributed over the cluster in the form of Partitions. If there is any shuffling or any partition needs to be written to disk Spark uses Java Serialization and this process is very costly as it takes more time and CPU resources.

Dataframe - Spark dataframes can serialize the data into memory(off - heap memory) in binary format and can perform all the transformation operations directly in memory itself. So thee is no need of Java serialization in case of Dataframes. It uses Tungsten execution in the backend to generate the byte code for the expressions evaluated, which explicitly manages the memory.

Dataset - In case of Datasets, they will have encoders which handle the communication between JVM objects to tabular representation. Dataset stores the tabular representation using Tungsten binary format. Dataset allows the user to perform operations directly on the serialized data, which improves the memory usage.

Garbage collection

RDD - RDD has lot of memory overhead.

Dataframe - It has lesser garbage collection compared to RDD.

Dataset - There is no need of garbage collector as it Tungsten serialization, which uses off heap data serialization.

Operation fastness

RDD - RDDs will give worst performance while doing aggregations on them

Dataframes - As they have optimizers embedded in them, they give good performance and faster results when we perform aggregation operations.

Datasets - Datasets also provide faster result as they have Catalyst optimizer.

Schema inference

RDD - RDDs won't have any schema, so we can't perform operations using column names.

Dataframe - Dataframes can infer the schema from the files and users also can define a custom schema.

Dataset - Dataset also can infer the schema from the files using Spark SQL framework and user can also define the schema.

Blog

Spark RDD vs Dataframe vs Dataset

1 thought on “Spark RDD vs Dataframe vs Dataset”

Leave a Reply Cancel reply