Using Apache Avro in Apache Hive

Avro is a serialization system developed by Apache. It is a row-based storage format.

Avro contains the data definition as well as the data in the same message or file. The data definition is stored in JSON format, making it easy to read and analyze; the data itself is stored in binary format, making it compact and efficient.

Avro supports rich data structures, a compact binary encoding, and a container file for Avro data sequences (often referred to as Avro data files). Avro is language-independent, with language bindings available for Java, C, C++, Python, and Ruby.

The above are the definitions or introduction about Avro as per the documentation. Comparatively Avro is better than PARQUET when it comes to WRITE operations. Also, Avro is better than JSON when it comes to data format. However, Avro data is in machine-readable binary format similar to ORC and Parquet.

In this post, we will see how to create a table with Avro schema and load Avro data into the table.

Based on the EMP table of Oracle SQL*Plus AVRO data file and schema has already been generated and available for download.

“Avro” data file sample:
avrodata_scrshot

Let’s begin the exercise:

–Creating the table with Avro Schema 

CREATE TABLE empavro
ROW FORMAT
SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’
STORED AS
INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’
TBLPROPERTIES (‘avro.schema.literal’='{
        “type” : “record”,
        “namespace” : “BigDatanSQL”,
        “name” : “Employees”,
        “fields” : [
                { “name” : “empno” , “type” : “string” },
                { “name” : “ename” , “type” : “string” },
                { “name” : “job” , “type” : “string” },
                { “name” : “mgr” , “type” : “string” },
                { “name” : “hiredate” , “type” : “string” },
                { “name” : “sal” , “type” : “string” },
                { “name” : “comm” , “type” : “string” },
                { “name” : “deptno” , “type” : “string” }
]
}’);

Now, load the data into the table.

LOAD DATA LOCAL INPATH ‘Desktop/Docs/empavro’ INTO TABLE empavro;

CreateAvroTableWithSchema

Let’s query and see if the data is inserted correctly.

QueryAvroTableVerifyData

Hope you find this article helpful.

Please do subscribe for more interesting updates.

4 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s