Apache Hive – Avro Sample Schema & Data

While learning any technology, it is important to have some example datasets for practice, and if sample data is unavailable and practice is not done, the learning will be incomplete. I want to be of assistance to beginners and learners by providing sample data/datasets for practice.

For practice, we’ve included an example Avro schema and Avro data in this post. More will be provided in subsequent articles.

You can download the schema here.
You can download the data here.

Or you can simply copy the below Avro Schema into notepad and save it as emp.avsc:

{
“namespace”: “testing.hive.avro.serde”,
“name”: “emp”,
“type”: “record”,
“fields”: [
{
“name”:”empno”,
“type”:”int”,
“doc”:”employee unique identity”
},
{
“name”:”ename”,
“type”:”string”,
“doc”:”employee full name”
},
{
“name”:”job”,
“type”:”string”,
“doc”:”designition”
},
{
“name”:”mgr”,
“type”:”int”,
“doc”:”reporting to”
},
{
“name”:”hiredate”,
“type”:”string”,
“doc”:”joined date”
},
{
“name”:”sal”,
“type”:”double”,
“doc”:”salary of the employee”
},
{
“name”:”comm”,
“type”:”double”,
“doc”:”Commission of the employee”
},
{
“name”:”deptno”,
“type”:”int”,
“doc”:”dept employee belongs to”
}
]
}

You can create an external table and load the data in a single step after copying the schema and data files to your HDFS location, as illustrated below.

CREATE EXTERNAL TABLE emp_avro
ROW FORMAT SERDE
‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’
WITH SERDEPROPERTIES (
‘avro.schema.url’=’/user/cloudera/empavro/schema/emp.avsc’)
STORED as INPUTFORMAT
‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’
LOCATION ‘/user/cloudera/empavro/data’;

EmpAvro_Schema_Data

Hope you find this article helpful.

Please follow for more interesting updates.