What is the difference between cache and broadcast variable?

Spark has lot of performance enhancement techniques. Two of them are cache() and broadcast variables. Although they both used to make some data readily available for processing, there are some differences between these two.

Broadcast variable

This is used to supply a copy of small sized data set to each and every executor. So whenever the executors need this data set they don't need to read them from the disk.

 

Cache

Spark uses lazy evaluation for computation to enhance the performance. This will be done by maintaining lineage, which is the composition of steps that are used to build an RDD. When we perform any action on an RDD all the steps in the lineage will processed. Spark by default will re-process these steps every time we call an action, which is inefficient. So in order to avoid this, we can call cache() on an RDD, which will store the processed data and there will no need to calculate all the steps when call an action multiple times.

 

So the difference between these two is broadcast variable is used to reduce communication cost over the network, whereas broadcast variable is used to reduce computation cost.

Leave a Reply

Your email address will not be published. Required fields are marked *