Even though Spark is very faster compared to Hadoop, Spark 1.6x has some performance issues which are corrected in Spark 2.x. And this is most commonly asked interview question. Let's take a look at the features of Spark 2.x.
- SparkSession - A new entry point for Spark application execution is introduced. There are several reasons for introducing it. Some of them are: we need to create separately sql context, hive context if we have only SparkContext. To know more details about SparkSession click here.
- Faster analysis - Spark 1.x uses compilers which uses of several function calls and CPU cycles, because of which so much unnecessary work spent on CPU cycles. Spark 2.x uses performance enhanced Tungsten engine. This new version of Tungsten project takes the query plan and collapses it into a single function by avoiding unnecessary function calls. It uses CPU registers to store intermediate instead using CPU memory as that of Spark 1.x. This will improve the performance by 10 time at least.
- Added SQL features - SQL in Spark 2.x added with more functionalities along with support for SQL2003. Spark SQL in 2.x version can run all of the 99 TPC-DS queries. When creating SparkSession without Hive support will have all the features when it create with Hive support. Native SQL parser which supports both ANSI - SQL and Hive QL. Native DDL commands are implemented.
- MLlib improvements - RDD based API is going into maintenance mode and DataFrame based has become the primary API now. Many machine learning algorithms like Gaussian Mixture Model, MaxAbsScaler, Bisecting K-Means clustering feature transformer are added to DataFrame based API and many ML algorithms added to PySpark and SparkR also.
- New Streaming module - In Spark 2.x a new high level streaming API built on Spark SQL and Catalyst Optimizer has been added. It is added as a separate module in Spark release with the name Structured Streaming. Using this users can connect to any streaming source and sink leveraging the optimization techniques which come with DataFrame API. And for DataFrame API support for connection with Kafka has been added but it is experimental stage.
- Unified Dataset and DataFrame APIs - DataFrame is a high level API for structured data introduced since 1.3 version of Spark. This has lot performance optimizations compared to RDD. Later in 1.6 version of Spark new API for structured data is introduced, DataSet. Dataset provide type safety compared DataFrames. In Spark 2.x onwards these two are unified to form new single API. Now Dataframe is just an alias for Dataset of Row. Datasets will have all the typed and untyped methods.
- A native CSV data source introduced based Databricks' csv data source.
- Improved parquet file throughput by using vectorization.