What is difference between cache() and persist in Spark?

Similar and related questions:

How do you cache dataset in Spark?

How many ways to cache the data in Spark?

What is the default cache level?

 

What is data caching?

Data caching means that data will be in memory even after it's usage. Means that, we have some data that needed frequently but we don't want it be calculated again and again when required. In that case we can cache the data and leave. When we want it again Spark will give the cached data instead calculating it again. So the time required to calculate the data will reduced. So overall it reduces some amount processing time of the entire job. Because of this reason caching is one of the performance optimization techniques.

 

What are the ways to cache the data in Spark?

There are two ways to cache the data.

1). cache()

2). persist()

 

Why there are different ways for same operation?

Cache:

Cache can be used when you want to cache the data in memory only. Cache cannot be used to save the data any other storage level.

 

Persist:

Persist can be used to cache an rdd in 5 different storage levels. They are:

1). MEMORY_ONLY

2). MEMORY_ONLY_SER

3). MEMORY_AND_DISK

4). MEMORY_AND_DISK_SER

5). DISK_ONLY

 

Note: By default persist uses MEMORY_ONLY storage level.

 

MEMORY_ONLY - Spark stores rdd as deserialized Java object in JVM. If the RDD doesn't fit in memory some partitions won't be cached and they will be computed on the fly when they required by the Spark job.

MEMORY_ONLY_SER - RDD will be saved as serialized Java objects. This will be space efficient but CPU intensive.

MEMORY_AND_DISK - Stores an RDD in the form of deserialized Java objects and store the partitions that don't fit memory on the disk and get them from disk when they are required.

MEMORY_AND_DISK_SER - RDD will be stored as serialized Java objects, spill the partitions that don't fit memory to disk and get them from disk instead calculating on the fly when they are required. This storage level cuts down both GC and expensive computations.

DISK_ONLY - Stores the RDD on disk only.

 

MEMORY_ONLY_2, MEMORY_AND_DISK_2 - These two are same as above storage levels but they replicate the data on two cluster nodes.

Leave a Reply

Your email address will not be published. Required fields are marked *