What is Spark SQL?

Spark SQL in one of modules available in Spark that runs on top of Spark Core module for structured data processing. It enables the user to use relational and procedural processing together with the help of the abstractions present in core module, Dataframe and Dataset. These are the abstractions that are used to interact with Spark SQL. It allows the user to import relational data from Hive tables, structured format files like Parquet, Avro, Sequence and run SQL queries on top them and to store them into files.


SQL components

SQLContext - This is the entry point for Spark SQL to work structured data in Spark. It allows to create the structured data abstraction Dataframes and allows the user to execute SQL queries.


HiveContext - It is similar to SQLContext but it allows the user interact with Hive tables. It give more functionalities compared to SQLContext.


JDBC Data Source -  In Spark, JDBC data source allows the user to read data from relational databases using JDBC API. It reads the structured data from databases and makes dataframes out of it. So it gives more functionalities compared to traditional databases in terms of processing the data.


Catalyst Optimizer - The most powerful and technical component in Spark SQL. Catalyst optimizer is a query execution optimizer. It optimizer the entire plan of execution. It uses both rule based optimization and cost based optimization makes the dataframes and datasets more power than rdds.


Features of SparkSQL

Integrated - It can work with structured data by using Dataframes and Datasets API making Spark SQL more user friendly and also it is available in several programming languages.

Uniformity in accessing the data - Spark Dataframes can be created from several data sources like Hive tables, Sequence files, Parquet files, Avro files, JSON and from JDBC connections also.

Compatibility with Hive - Spark SQL allows the user to directly work with Hive tables. Users can execute the Hive queries and can use UDFs in Hive.

Connectivity - Can connect to any data source using JDBC or ODBC connections.

Performance and Scalability - It give best performance using Catalyst optimizer and can scale to it's maximum level because it uses clustered computing.

Leave a Reply

Your email address will not be published. Required fields are marked *