One of the most commonly asked interview questions. If you are mid-level experienced professional this will be compulsory question. In this blog we will get to know about these to abstractions in detail and also you will get to know some knowledge about it which is not available on most commonly used forums like stackoverflow.
In Spark version 1.x we have the SparkContext as entry point for out Spark applications, for which we need to create SparkConf object. All our configurations will be set to SparkConf object and using which SparkContext object will be created as shown below:
val conf = new SparkConf().setAppName(“MyAppliction”).setMaster(“local”) val sparkContext = new SparkContext(conf)
Using the SparkContext above execution of the entire application starts.
Whereas in Spark version 2.x we have SparkSession as entry point. Which is completely different from SparkContext. Using this SparkSession we create DataFrames and Datasets and accomplish the task. Typical SparkSession creation code snippet looks similar to below code snippet:
val spark = SparkSession .builder .master("local") .appName(“WorldBankIndex”) .getOrCreate()
SparkContext object will also be created while creating SparkSesssion object and it can be accessed from the SparkSession created above, as shown in the below:
val context = spark.sparkContext
In addition SparkSessin object by default will contain the SQL context object also.
This is not the complete answer for our question. This is an answer that is known to naive in Spark. If you are an experienced professional, interviewer will expect something more for this question. And if you are done with the above answer during the interview and if the the interviewer is Kudoos, you will screwed there. His next question would be what is the need of doing that? We will that now..
If you observe two different SparkSessions created in an application will have the same SparkContext object. Try to create two different SparkSessions, get the SparkContext objects out of them and compare the hashcode of them. They will be same. This is because having multiple SparkContext objects in single application might lead unexpected behavior of Spark application. And also failure of one SparkContext can cause the failure of the other, which ultimately can cause JVM to stop working. It's not guarantee that having multiple SparkContext objects in application will make pipeline flow to work properly.
Another reason for having only one SparkContext object and multiple SparkSesssions is multiple users to process the data. If we are using Spark version 1.6 and we have multiple users to operate on the datasets and need the datasets to be isolated for each user then we have to create multiple SparkContext objects, which is not at all acceptable in performance perception. Whereas for the same situation is handled in Spark version 2.x by creating multiple SparkSessions and single SparkContext. So to handle this situation SparkSession was created.
Please comment you thoughts about this post..