One of the common questions in Spark interview. There is a lot about this Catalyst Optimizer. Here in this blog I tried my level best to cover it.
What is Catalyst Optimizer?
Spark SQL is one of those component out of all. Spark SQL enables the user to interact with data in SQL fashion. User can write SQL queries and analyse the data. As Spark deals with huge volumes of data, there should be optimized way to do the operations on the data. Optimization is to make a accomplish a job in a more efficient than it is doing. At the core of Spark SQL lies the Catalyst Optimizer. Catalyst optimizer makes use of some advanced programming language features to build optimized queries. Catalyst optimizer was developed using programming construct Scala. Catalyst Optimizer allows both Rule Based optimization and Cost Based optimization. In Rule Based optimization, based on some rules optimizer will determine the execution of the query. In Cost Based optimization, multiple plans for the executions will be developed by optimizer and then calculates the cost. Post that decides the best way to do complete the operation.
Fundamentals of Catalyst Optimizer
Catalyst optimizer makes use of standard features of Scala programming language like pattern matching. Catalyst optimizer contain trees and set of rules to manipulate the trees.
Trees - Main data type that is used in Catalyst Optimizer. Tree is a group of nodes. A node can possess one or more nodes. Nodes are defined as subclasses of TreeNode class and these objects are immutable. They can be manipulated using functional transformations.
Rules - Rules are used to manipulate the tree. We can defines rules as functions from one tree to another.
There will be four phases in executing the Spark SQL query.
- Logical Optimization
- Physical Planning
- Code generation
Analysis - Spark SQL optimization starts from the relation that is to be processed. It can be computed either abstract syntax tree given by SQL parser or by dataframe object. First it constructs an unresolved logical plan and then applies the below steps:
- Search relation by it's name from catalog
- Map the name attributes
- Determine which attributes match to the same value to give them unique ID.
- Propagate and push type through expressions.
Logical Optimization - In this phase, the rule based optimization applied to the logical plan. This includes constant folding, predicate push down, projection pruning and some other rules. It becomes easy to add a rule for various situations.
Physical Planning - In this phase, one or more physical plans will be created from the logical plan, using physical operator that matches the Spark Execution engine. Then based on cost model right plan will be selected. It uses cost-based optimization for selecting the join operations. For small relations it uses broadcast joins. This framework supports high usage of cost based optimization. It can estimate the cost by applying the model recursively for the whole tree. Rule-based optimizations like pipelining projections or filters into one Spark map operation will also be done in Physical planning phase.
Code generation - This is the final phase of execution in which, the Java byte code will be generated that is to be executed on each machine. Catalyst optimizer uses a special feature of Scala called Quasiquotes to generate this byte code. Quasiquotes let the programmatic construction of abstract syntax trees in Scala, which can be fed to compiler to generate byte code. With the help of Catalyst optimizer we can transform the tree representing the expressions in SQL to abstract syntax trees for Scala code to evaluate expressions, post that compile and run the generated byte code.