Bigdata developers always have to have some knowledge about internal working of all the components. That's where we get to know how the components are working and how to design our code according to them without having any performance glitches. So in this blog we will look into Hive architecture.
Below is the overall architecture of Hive. If you can under this architecture diagram you can easily remember how Hive works.
This is where the end user can interact with Hive to process the data. To interact with Hive we have several ways like Web UI and Hive CLI, which comes built in with Hive package. Apart from these we can Thrift Client, JDBC client, ODBC client. Also Hive provides us the services like Hive CLI, Beeline etc.
Hive QL Process Engine (or)Hive Compiler
Hive compiler parses the query supplied through user interface. It checks for semantics and syntax correctness using metadata stored in metastore. And finally it creates an execution plan which is a DAG(Directed Acyclic Graph), where is each stage is a mapreduce job.
Execution Engine is where the actual processing of the data will start. After compiler checking the syntax, performs the optimizations of the execution. Finally this execution plan will be given to Execution Engine. We have several execution engines that can be used with Hive. MapReduce is one of the execution engines which slower compared other engines. We can change to this execution engine to Tez or Spark. To change the execution engine we can use the below command:
Metastore is the central repository where the metadata about tables will be stored. This metadata includes database names. table names, column details along with data types of columns, and table partition details. It also stores the details about Serialization and Deserialization details of the files stored in underlying storage system. In general metastore is relational database. Metastore provides thrift server to interact with it.
Metastore can be used in two modes.
Remote mode: In this mode metastore is a Thrift Service which can be used in case non-Java applicaions.
Embedded Mode: In this case client can directly interact with metastore using JDBC.
Hive itself cannot store the data directly. Hive can just to process the data and insert the data into tables but the actual data will be stored on storage systems like HDFS or HBase or S3. Hive will make the tables to point to the location where the data is stored in any of the above storage systems and the data will be retrieved from that location.
What happens when the user hit Hive with a query?
Step1: First the user will submit a query from any of the user interfaces mentioned above.
Step2: Then Hive compiler will check for the syntax correctness and looks for the metadata about the tables in the metastore. Compiler creates an execution plan, optimizes it and creates DAG.
Step3: Execution plan will be executed by Execution Engines set in Hive. While executing the plan Hive looks for the required data on Storage layer.
Step4: Finally Hive will supply the results to the user interface.
Hope you enjoyed it. Please comment your thoughts..