What are the techniques to optimize a Spark job?
This is a super important question for a Big data developer interview. This question will be there in every interview. You can't escape from this. And there are lot solutions available for this question but here is a post of all the available techniques to optimize a Spark job.
1). Choosing right data abstraction out of all the available
We have three data abstractions in Spark. They are RDD, Dataframes and Datasets. Based on our requirement we have to appropriate data abstractions. Let's look at the features of all the three data abstractions. Based on the details give below you will get knowledge which abstraction to choose for better performance.
RDD - Core abstraction of RDD and other abstractions are created out of it.
- RDD gives us worst performance as it doesn't have any optimizer embedded in it.
- No query optimization available
- Garbage collection over head is high.
DataFrames - Dataframe is created using RDD but with some additional features.
- Dataframe gives best performance in most of the cases.
- Provides query optimization using Catalyst optimizer
- Garbage collection overhead is lower than rdd.
- Disadvantage of using dataframes is they not compile time safe.
- Another advantage is Whole-Stage code generation
Datasets - Dataset is also created by using rdd by adding some features.
- Provides query optimization using catalyst optimizer.
- Compile time safe
- There will performance impact when we are dealing with aggregations.
- Serialization/Deserialization overhead
- Garbage collection overhead is high
- No Whole-stage code generation
Using cache or persist
Try to use cache as and when requires. Caching the intermediate improves the performance of the Spark job. This is because the cached data will be used as it is when the the task needs it so that there won't any need to calculate it again.
By default Spark used native Java serialization, which uses ObjectOutputStream to serialize the objects. Java serialization is always slow and leads to performance inefficiency. Instead of Java serialization we can use Kryo serialization. Kryo serialization 10x faster than normal Java serialization and gives us the best performance. You can switch to Kryo serialization using code snippet:
Bucketing is to divide the data into subsets. The basic difference between partitioning and bucketing is bucketing holds set of columns instead of one. The buckets to which the data should go will be decided using hashing. Using bucketing we can achieve good performance in joins, aggregation and query optimization, which will be done using bucketing metadata.
Avoid shuffling as much as possible
While writing Spark jobs we should take care of shuffling. Shuffling is the process transferring data between partitions. If partitions reside on different nodes then the data will be serialized and sent over the network. This process consumes lot of CPU resources and network bandwidth. This we can do by carefully choosing the methods to be used in the code. For example we can use reduceByKey method instead of groupByKey, because reduceByKey uses combiner which reduces data to be shuffled among the partitions.
Using broadcast joins
Broadcast joins improve the performance of the Spark jobs. As this should covered in details here is another post with fill details of Broadcast Joins.
Right configuration of memory and other resources
Allocating correct number of cores, executors and memory is very important. If these are not configured properly job may run into memory issues, which can't be handled by any optimization technique. And this is even a very big topic. Here's the page for it. How to allocate memory for a spark Job?