What is the difference between Spark cluster mode and client mode?
Cluster mode vs Client mode in Spark?
What is deployment mode?
Deployment mode is the specifier that decides where the driver program should run. We can specifies this while submitting the Spark job using --deploy-mode argument. The default value for this is client. Based on the deployment mode Spark decides where to run the driver program, on which the behaviour of the entire program depends.
There are two types of deployment modes in Spark.
i). Client mode
ii). Cluster mode
In this mode, driver program will run on the same machine from which the job is submitted. The main drawback of this mode is if the driver program fails entire job will fail. Client mode can also use YARN to allocate the resources. Client mode can support both interactive shell mode and normal job submission modes. But this mode gives us worst performance. In production environment this mode will never be used.
In this mode the driver program won't run on the machine from the job submitted but it runs on the cluster as a sub-process of ApplicationMaster. The advantage of this mode is running driver program in ApplicationMaster, which re-instantiate the driver program in case of driver program failure. Cluster mode is not supported in interactive shell mode i.e., saprk-shell mode. Cluster mode is used in real time production environment. To use this mode we have submit the Spark job using spark-submit command.
Below is the diagram that shows how the cluster mode architecture will be:
In this mode we must need a cluster manager to allocate resources for the job to run. Below the cluster managers available for allocating resources:
1). Standalone - simple cluster manager that is embedded within Spark, that makes it easy to set up a cluster
2). Apache Mesos - a cluster manager that can be used with Spark and Hadoop MapReduce.
3). YARN - resource manager in Hadoop 2
4). Kubernetes - an open source cluster manager that is used to automating the deployment, scaling and managing of containerized applications.
In this mode the driver program and executor will run on single JVM in single machine. This mode is useful for development, unit testing and debugging the Spark Jobs. But this mode has lot of limitations like limited resources, has chances to run into out memory is high and cannot be scaled up. In addition, in this mode Spark will not re-run the failed tasks, however we can overwrite this behavior. Spark UI will be available on localhost:4040 in this mode.
When to use Cluster mode?
If the client machine is far from our actual cluster then we should go for cluster mode. Because if we launch it in client mode then for executors to communicate with driver causes network latency and also if the client machine goes offline then we lose entire application.
When to use Client mode?
Client mode can be used when the client machine is located within the cluster. In this case we don't need to worry about any network latency and maintenance of cluster will taken with utmost important so no need to worry about failures as well.