When do Spark spills the cached RDD or DataFrame onto disk? (or) What is the threshold to spill the cached data onto disk in Spark?

Caching is one of the best optimization techniques available on Spark. When we cache any RDD or DataFrame, Spark will make the processed collection object readily available when we want it again the application without the need of recomputing again. It reduces a lot of time. But if the size of the cached collection object is more than the available memory then Spark will spill the data to disk provided we give a appropriate StorageLevel. But how Spark will decide when to spill the intermediate data to disk?

We set a lot of Spark properties in spark-default or we will pass them with spark-submit and also sometimes inside the code directly on SparkConf object.

One of those properties is:

 spark.memory.fraction

Fraction of (heap space - 300MB) used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records. Leaving this at the default value,which is 0.6 is recommended.

And also we have another property:

spark.memory.storageFraction

Amount of storage memory immune to eviction, expressed as a fraction of the size of the region set aside by spark.memory.fraction. The higher this is, the less working memory may be available to execution and tasks may spill to disk more often. Leaving this at the default value, which is 0.5 is recommended.

Leave a Reply

Your email address will not be published. Required fields are marked *