DAG - Directed Acyclic Graph
A DAG comprises of edges and vertices, in which edges represent rdds and vertices represent operations to be performed on rdds. On calling any action DAG will be submitted to DAGScheduler. Further this job will be divided into stages, where a stage is operations between two shuffles.
How DAGScheduler works in Spark?
i). Scala interpreter works on the code first to create binary code.
ii). Spark creates a graph after compiling the source code.
iii). Whenever an action called, Spark submits this graph to DAGScheduler.
iv). DAG scheduler divides the job into stages.
v). These stages are passed to task scheduler and through cluster manager spark launches the job.
vi). Finally worker nodes will execute the job.
How DAG achieves fault tolerance?
Spark always divides the task into stages. During the execution each executors works on a partition at time. These operations together called a DAG. If there is any failure like node failure, Spark recognizes it. Then it first finds the missing partition, as it has all the steps in DAG which has all the steps of the job, it rebuilds the missing partition. So there won't be any data loss.
How DAG optimizes the Spark Job?
DAG optimizer does this job. DAG optimizer rearranges the operations in such way that processing time will be lesser. For example consider we created a job which has map job followed by filter operation. Then DAG scheduler rearranges filter operation first and then map. So the number of records that will operated in map operation will be lesser as some of the records are already filtered.
What is the difference between DAG and Lineage graph?
This is a common interview question that we face in interviews. DAG maintains the steps to rebuild the partitions and Lineage also does the same. Then what is the difference between these two.
Lineage a set of steps which will be used to rebuild partitions of an RDD. Lineage is confined to RDDs only. Whereas the DAG is a combination of edges and vertices. Vertices represent rdds and edges represent the operations to be performed on them. DAG always divides the task to in stages but rdd will not.