How to process JSON data or file in HIVE without using JsonSerDe?

It is very rare that the usage of HIVE with JSON. But sometimes business requirements might force the developers to use HIVE with JSON as underlying storage format. Hive has support for several file formats without the need of using any SerDe jars, just by specifying the storage format in the create table command. But for JSON we can't specfy that in create table command. We have alternate methods to do it. One of them is, we can use a JsonSerDe as "ROW FORMAT SERDE" and another is storing JSON fiie as normal text and process it using JSON in-built methods available in HIVE. Using SerDe it is the same way of dealing with other fiel formats. So let's see how we can do this without using the SerDe.

So the first step to store raw JSCON text in HIVE is to create a table that contain only one column with String datatype, which stores each line from JSON as one row.
Create table command:

(data String)
LOCATION "/user/hive/bigdata";

Now load the JSDN file into table by using LOAD command:

LOAD DATA INPATH "user/hive/date/orders.json" INTO TABLE data;

After loading the data below is the output of the query "select * from data":

|                     |
|{"order_id": "123", "order_status_history": [{"status": "Created", "time": "2020-05-01 00:00:00"}, {"status": "Shipped", "time": "2020-05-03 00:00:00"},{"status": "Delivered", "time": "2020-05-04 00:00:00"} ]}|
|{"order_id": "455", "order_status_history": [{"status": "Created", "time": "2020-05-01 00:00:00"}, {"status": "Shipped", "time": "2020-05-03 00:00:00"}]}|
|{"order_id": "567", "order_status_history": [{"status": "Created", "time": "2020-05-01 00:00:00"} ]}|


In the above output, we can observe that the table has only one column and each row is a line from JSON file.

Now we can access any of the columns from JSON file using in-built JSON utility methods available in HIVE.

If we want to access the column order_id, then we can write query as shown below:

select get_json_object (data,'$.order_id') from data;
| _c0  |
|  123 |
|  455 |
|  567 |

If we want to access the column order_status_history:

select get_json_object (data,'$.order_status_history') from data;
|                       _c0                          |
| [{"status":"Created","time":"2020-05-01 00:00:00"},{"status":"Shipped","time":"2020-05-03 00:00:00"},{"status":"Delivered","time":"2020-05-04 00:00:00"}] |
| [{"status":"Created","time":"2020-05-01 00:00:00"},{"status":"Shipped","time":"2020-05-03 00:00:00"}] |
| [{"status":"Created","time":"2020-05-01 00:00:00"}] |

In the above output we see that the column order_status_history has sub-columns. So to access one of the sub-columns time from the order_status_history we can write a query like below:

select get_json_object (data,'$.order_status_history.time') from data;
|                         _c0                        |
| ["2020-05-01 00:00:00","2020-05-03 00:00:00","2020-05-04 00:00:00"] |
| ["2020-05-01 00:00:00","2020-05-03 00:00:00"]      |
| ["2020-05-01 00:00:00"]                            |

To get status column from order_status_history column:

select get_json_object (data,'$.order_status_history.status') from data;
|                _c0                 |
| ["Created","Shipped","Delivered"]  |
| ["Created","Shipped"]              |
| ["Created"]                        |

Leave a Reply

Your email address will not be published. Required fields are marked *