Process and analyze JSON documents – Apache Hive

A JSON file is a file in the JavaScript Object Notation (JSON) format that stores simple data structures and objects, which is a standard format for data interchange. It is used mainly for the transmission of data between a web application and a server.

Though the JSON formatted data is is easy for machines to parse and generate, traditional relational database systems didn’t have proper functionalities to parse and convert into tabular format.  But this is past. Now, in recent times, most of the relational systems introduced features that can read and convert the JSON data. In Apache Hive, this can be done in two ways.

Although it is simple for machines to parse and generate the JSON formatted content, conventional relational database systems did not have proper features to parse and convert to tabular format. But it’s in the past. Now, most relational systems have recently added features that can read and transform JSON data.

In Apache Hive, the JSON documents can be handled in many ways. Either you can use custom SerDe (Serialization & Deserialization) or you can simply use a user defined function “Get_Json_Object” to perform the task or you can write your own UDF or you can use Json_tuple UDF.

This article focuses on the use of Get_Json_Object function in Apache Hive.

Below is a sample data.
{“name”:”John”, “age”:29, “loc”:”Dubai”}
{“name”:”Smith”, “age”:24}
{“name”:”Alice”}
{“age”:19, “name”:”MIller”}
{“age”:25, “loc”:”Sharjah”}

Let’s create the table and load the data.

CREATE TABLE JsonTest(Json String);
Here Json is a column name, you can name as per your requirement.

LOAD DATA LOCAL INPATH ‘Desktop/jsondata2.json’ INTO TABLE JsonTest;

Verify the data
SELECT * FROM JsonTest;

createtable_loaddata_jsontest

The above implementation is required for the JSON document to be composed of a single row. It is mandatory to flatten the JSON document to a string. The below step helps in converting the Json data into tabular format.

SELECT
GET_JSON_OBJECT(json,’$.name’),
GET_JSON_OBJECT(json,’$.age’),
GET_JSON_OBJECT(json,’$.loc’)
FROM jsontest;

Get_Json_Object

Another Example:
Dataset2:
{“id”:121,”status”:”initiated”,”date”:1563981100},
{“id”:121,”status”:”followup”,”date”:1563981111},
{“id”:121,”status”:”success”,”date”:1563981141},
{“id”:122,”status”:”initiated”,”date”:1563986100},
{“id”:122,”status”:”followup”,”date”:1563986111},
{“id”:122,”status”:”success”,”date”:1563986141},
{“id”:123,”status”:”initiated”,”date”:1563996130},
{“id”:123,”status”:”followup”,”date”:1563996150},
{“id”:123,”status”:”failed”,”date”:1563996166}

–Creating a table to hold the in a single string.
CREATE TABLE jsontest2(Json STRING);

–Loading the data
LOAD DATA LOCAL INPATH ‘Desktop/my_record1.json’ INTO TABLE jsontest2;

–Converting the Json data into columns
SELECT get_json_object(json, ‘$.id’) AS id,
get_json_object(json, ‘$.status’) AS status,
get_json_object(json, ‘$.date’) AS date FROM jsontest2;

Get_Json_Object2

Hope you like this post.

Please do click on the follow button to receive notifications on newer posts.

4 comments

Leave a Reply