Using Avro Schema URL in Apache Hive

Please visit my previous article for an introduction and detailed information regarding Apache Avro and its use in Apache Hive. This is a follow-up to “Using Apache Avro in Apache Hive“.

In the last post, we saw how to create a table by including a schema definition in the create table statement. This article will show you how to create a table by mapping the schema location if the schema is in a file.

Please use the same data file mentioned in the previous post, or get it from here.
You can download the “schema” from here.

The following is the information in my “emp.avsc” file which is already placed in HDFS location – “/user/cloudera/avroexample/”.

Let’s begin the exercise:

— Creating the table with ‘SERDEPROPERTIES”
CREATE TABLE empavro2
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’
WITH SERDEPROPERTIES (
‘avro.schema.url’=’/user/cloudera/avroexample/emp.avsc’)
STORED as INPUTFORMAT
‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’;

In contrast to previous CREATE TABLE statement, we haven’t included any column or table definition in the above statement. We directed Hive to retrieve the schema from the specified location. AvroSerDe will read and pull the avro schema (avsc) and understands the table properties for the table.

AvroSerDe reads all Avro files within a table against a specified schema. And, understands compressed Avro files and writes any Hive table to Avro files.

LOAD DATA LOCAL INPATH ‘Desktop/Docs/empavro’ into table empavro2;

With the above statement, the data will be loaded into the table. You can cross verify the data now.

avro_schema_specification

In the above example, we used “SERDEPROPERTIES,” but this may also be stated in the alternative manner by providing “TABLEPROPERTIES”. Look the below example.

— Creating the table with ‘TABLEPROPERTIES”
CREATE TABLE empavro2
ROW FORMAT SERDE
‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’
STORED AS INPUTFORMAT
‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’
TBLPROPERTIES (
‘avro.schema.url’=’/user/cloudera/avroexample/emp.avsc’);