Miscellaneous Spark interview questions – Part I

1). Spark jargon - Job

A Spark application will have a number of sub-processes and each of them can be called as a Job. The number of Spark jobs in an application is equal to the number of reads, infer schema and actions in the application. And also each Spark job should have at least one stage. So we can assume that a Spark job can have any number of stages.

 

2). Spark jargon - Task

Task is a sub-process of a Stage. A stage can have any number of tasks. The number of tasks you could see in each stage is the number of partitions that Spark is going to work on and each task inside a Stage is the same work that will be done by Spark but on a different portion of data.

1 Task = 1 Partition = 1 Slot = 1 Core

 

3). How to check if an array has duplicates (or) How to check if a DataFrame has duplicate rows or not.

Most of the Scala collections objects like List and Vector come with an in-built utility method called distinct, which gives us unique elements in the collection object. The same utility method we have in DataFrame API as well. To find the whether a collection object or DataFrame has duplicates, we count the number of elements in collection, then we can apply distinct and then we can compare the counts before applying distinct and after applying distinct. If they are not equal then we can say collection object has duplicates.

Below is a sample code to check it:

val dupliDF = Seq(4,9,1,2,7,3,1,5,2,3).toDF("id")
val count_before = dupliDF.count

val uniqDF = dupliDF.distinct
val count_after = uniqDF.count

if(count_before!=count_after){
    println("DataFrame has duplicates")
}
else{
    println("No duplicates found")
}

 

4). What is the default storage level when call cache() method in Spark?

As of Spark 3.x version default storage level is different for different APIs. The default storage level for RDD is MEMORY_ONLY and the same for DataSet is MEMORY_AND_DISK.

 

5). What is the value that we need to provide to the parameter "--master" in Spark Standalone mode?

In Spark Standalone mode the argument --master takes ip address of the master node.

 

6). What is the thumb rule for setting number partitions in a Spark application? (Or) How can we decide number of partitions required in a Spark application?

The number of partitions in a Spark application should be larger than number of executors and potentially this value must in multiples of number of executor depending on the workload.

 

7). What happens when we call "persist()" method on an RDD with the storage level "MEMORY_ONLY" but available memory is less than size of RDD? How Spark framework will handle this situation?

We have called persist() method with "MEMORY_ONLY" storage level. So we are restricting it to in memory only. When the size of RDD is greater than the available memory, Spark framework will not cache the RDD. It will recompute whenever necessary.

 

8). How to add unique index to each row for a DataFrame in Spark Scala?

We have an in-built indexing function monotonically_increasing_id to achieve this. Below is the code snippet that show how to use this function.

scala> val data = Seq("AAA","BBB","CCC","DDD","EEE").toDF("names")
data: org.apache.spark.sql.DataFrame = [names: string]

scala> data.withColumn("id",monotonically_increasing_id).show
+-----+---+
|names| id|
+-----+---+
|  AAA|  0|
|  BBB|  1|
|  CCC|  2|
|  DDD|  3|
|  EEE|  4|
+-----+---+

 

9). Is cache operation come under lazy evaluation or eager evaluation?

Cache() operation is evaluated as a lazy operation. When can we call cache operation on an RDD or DataFrame, Spark doesn't evaluate in that step immediately. Whenever we call an action then that collection object evaluated and stored as per the storage level provided.

 

Leave a Reply

Your email address will not be published. Required fields are marked *