Explain Spark architecture (or) What happens when submit a Spark job?

This is one of the most common interview questions. As a Spark applications developer we must have the idea of the architecture of Spark. Here's the blog to get complete knowledge on Spark Run Time Architecture.

 

In this blog we will get to know the run-time architecture of Spark along with some key concepts in it and terminologies like SparkContext, Shell, Application submission, task, job, stage etc. Ans also the driver program role, executors role and cluster manager role.

Spark Architecture is bases on two main abstractions. They are:

1). RDD (Resilient Distributed Dataset)

2). DAG (Directed Acyclic Graph)

 

RDD - An RDD is a collection of items that are divided into partitions. These partitions are stored in memory of the work nodes in Spark cluster. RDD's can be created using the files on HDFS or any other storage system supported Spark and using parallelized collections of Scala native collections. RDD supports two types of operations: 1). Transformations 2). Actions

 

DAG - DAG is graph that stores series of operations as edges and rdd's as nodes. The operations that are mentioned on the edges will be carried on the corresponding nodes, which are RDD's.

 

Daemon processes in Spark

Spark has master-slave architecture using two main daemon processes:

i). Master Daemon

ii). Worker Daemon

In Spark only one master will be there and multiple workers. Driver program is the master of all the executors. Driver and executors will run on their on JVMs.

 

Before getting deeply diving into architecture we must have known some terms and their definitions. They are:

SparkSession/SparkContext -

This is the heart of Spark application execution. By using this we can create RDDs, accumulators, broadcast variables to execute the program. The main duties of SparkContext are:

  • To keep track of status of an application
  • To cancel a job
  • To cancel a Stage
  • Running the jobs synchronously or asynchronously
  • Dynamic resources allocation
  • Accessing persisted data and to unpersist the persisted data

Spark -Shell

This is an application developed in Scala, which offers the user an interactive command line environment with auto-completion of statements. This can be used to get familiar with the functionalities available in Spark and to run small sample programs there by getting knowledge to develop a complete standalone job.

Task

Task is a unit of work that will run on the executor. Each stage will have some task and one task will be running on one partition. The same will run on the all the partitions on all the executors.

Job

Job is a set of tasks that will be running in parallel. They are the responses to the actions function of Spark.

Stage

Each job is divided into sets of tasks called Stages. Stages will depend on each other i.e., later stage will depend on the former stage. An application process will be done in multiple stages.

 

Below is the diagram that shows the architecture of Spark:

 

Let's look into each and every component present in the above architecture diagram:

Driver Program

This is the entry point to a Spark job. The main() method of the Spark job will be defined in driver program and this is where the SparkContext or SparkSession will be created. The driver program runs the user code on executors and creates RDDs and performs the transformations and actions. It divides the Spark applications into Stages and schedules them for running on the executors. The task scheduler always lies in the driver program and distributes the program among the worker nodes or executors.

Cluster Manager

This is the one which allocate the memory and resources for the driver program to execute. The resources that are required for job can be requested dynamically whenever required and will bee freed whenever they are not getting used. For client mode this might not present but in production environment client mode never be used, so this is an optional component. There are several cluster managers available that work with Spark. They are: 1). Standalone, 2). YARN 3). Mesos 4). Kubernetes. Choosing one among these cluster mangers depends on the final goal of our project purpose. Because all of these will provide different set of schedulers and scheduling capacities.

Executors

The divided tasks will be executed on the individual executors in Spark cluster. Every Spark job will have its own executors and these executors will be at the the beginning of the Spark job and will be terminated at the end of the execution. This is called Static Resources allocation. However we can configure for Dynamic Resources Allocation, in which  the resources will be allocated and freed as and when required. Executors are the actual data processors. They read the data for processing and write the output to storage systems. They store the computational data in memory.

How to launch a Spark program?

We can launch a Spark program using spark-submit command. We can supply all the inputs required and additional configurations like driver memory, number cores etc in this command. This program launches the driver program to start the execution.

 

What happens when we submit a Spark Job?

Using spark-submit command user submits the Spark application to Spark cluster. This program invokes the main() method that is specified in the spark-submit command, which launches the driver program. The driver program converts the code into Directed Acyclic Graph(DAG) which will have all the RDDs and transformations to be performed on them. During this phase driver program also does some optimizations and then it converts the DAG to a physical execution plan with set of stages. After this physical plan, driver creates small execution units called tasks. Then these tasks are sent to Spark Cluster.

The driver program then talks to the cluster manager and requests for the resources for execution. Then the cluster manger launches the executors on the worker nodes. Executors will register themselves with driver program so the driver program will have the complete knowledge about the executors. Then driver program sends the tasks to the executors and starts the execution. Driver program always monitors these tasks that are running on the executors till the completion of job. When the job is completed or called stop() method in case of any failures, the driver program terminates and frees the allocated resources.

Leave a Reply

Your email address will not be published. Required fields are marked *