This will be must and should question in Spark interviews. Understanding the resource allocation for a Spark job is very important, because entire Spark job execution completely depends on these parameters. If we don't allocate the resources properly job might run into issues and the might be starving for the resources. In this blog we will get the knowledge how to allocate the resources for Spark job like how many executors needed, how much memory required for each executor and how many number of cores required for each core. Below are the factors based on which we need to allocate the resources for Spark Jobs. They are:
- The size of the data we going to process
- Time within which the job needs to be completed
- Resources allocation - Dynamic/Static
- Upstream or Downstream application
Static Resources Allocation
Assume that we have 6 nodes and each node has 16 cores and 64GB RAM.
In each and every node 1 core and 1 GB RAM must be allocated for the OS and background daemon process that always run. So we are left with 15 cores and 63 GB RAM in all the nodes.
Number of cores:
Number of cores will be equal to number of concurrent tasks that run on an executor. But here is a limitation on number of cores. As discussed just now number of cores are equal to number of tasks that an executor can run but this should 5, because researches show that more than 5 parallel tasks lead bad performance of Spark job. So the optimum value to run job properly, the number of cores must be 5, even if we have 15 cores available in each node.
Number of executors:
We have 15 cores available on each node and each executor can have 5 cores. So the total number of executors can a node has is 15/5 = 3. If we calculate number of executor for all the 6 nodes it comes as, 3*6 = 18, because we have 3 executors on each node and we have 6 nodes. So we have total 18 executors out of which one executor must be dedicated to ApplicationMaster in YARN. So we are left with 17 executors. So we can pass 17 to the parameter --num-executors along with spark-submit command.
Memory for each executor:
We have 3 executors on each and 63 GB RAM on each node.
So each executor can take 63/3 = 21 GB RAM.
But there will be small memory overhead involves to determine full memory request to YARN.
So the formula for memory overhead is maximum(384MB, 0.7*spark.executor.memory).
On calculating this we will get max(384 MB, 1.47GB).
So the maximum out of these two is 1.47 GB so the memory overhead is 1.47 GB.
So the memory available for each executor will be 1.47 GB which can be rounded off to 2 GB lesser than available.
So the memory available for each executor is 19 GB.