What is meant by shared variable? What are the shared variables available in spark?

What is a shared variable?

A variable that is available on all of the executors or nodes that work on the same operation.

 

What is need of shared variables?

In general when a function passed to a spark job is executed on remote cluster having multiple nodes and the process will run on separate executors. So the other executors will not have any knowledge about the variables of the same function that are running on multiple variables. When we try to access the variables we will get the value of the variable that might come from any of the executors, which might differ in other executors.

 

Keeping this in mind, we will go ahead a scenario. Suppose if we want to perform a sum of operation on an Integer data, we will define a variable to sum up. As discussed above at the end of the if we access the sum variable we might the sum that is available on one nodes but not cumulative. In this case we must have a variable that should be available on all of the executors.

 

What are they?

Spark supports two types of shared variables. They are:

1). Accumulators

2). Broadcast variables

 

Accumulator

Accumulators are variables that are added through associative and commutative operation. So they can be used efficiently in parallel processing. Accumulators are used as counters and sums. Initially accumulators are supporting numeric types but later support for new types are added

 

//Code

 

 

Broadcast variables

Broadcast variable let's the user to keep a read only cached copy of dataset on each and every machine rather supplying the data every time when it's required. This is one of performance optimization techniques. Spark performs the operations by dividing the into stages and in between two stages intermediate data generated will be shuffled. Spark always tries to broadcasts the common data required by tasks so that it can reduce the shuffling. Users can also allowed broadcast the data. If we have any large dataset that should available to all the machines in the cluster then you can broadcast the dataset and make it available to all the nodes, so that nodes need not read the data every time.

//code

Leave a Reply

Your email address will not be published. Required fields are marked *