Apache Hive – Store & Access Nested Values – Complex Data Types- Part-1

This is going to be a series of articles to discuss about complex data types that are available in Apache Hive. There will be too many simple to complex scenarios for you to learn.

Introduction:
Hadoop is designed to manage big data as well as complex data that cannot be managed using primitive data types. Complex data types are nested data structures made up of primitive data types, or these data structures can often be made up of other complex types. Most programming languages like Python, C++, Java, Apache Hive and Impala supports complex data types such as Array, Struct, Map and Union.

Let’s go deep to understand the primitive and complex datatypes.

ComplexDataTypes

Now look at the below table:
emp_table
The datatypes for the above table should be as follows:
EmpSchema
This is as per relational databases. In Hive, we use STRING instead of VARCHAR.  Well, this was an example to illustrate the primitive data types.

Look at the below datasets to understand the “complex data types”

ARRAY:
Array_Example

The above data cannot be implemented using traditional systems, since it is against to its basic rules. In the first example, the column Teammates is having array of ‘string’ values. Similarly in the second example, the column MarksInAllSubjects is having array of “Integer” values.

If there is a collection of items with similar data types, then use ARRAY data type.

Now look at the below dataset.

MAP:

MapExample
There are several key and value pairs in the columns in the above examples. The subjects of mathematics, science, physics, chemistry are considered as keys in the first case, and the marks are known as their values. Similarly in the second example, the column “Births Per Year” has key value pairs. “Year” is the key and the count of births is its value.

If the data is in such key-value pairs, consider ‘MAP’ datatype to use.

Now, to understand STRUCT, look at the below example.

STRUCT:
StructExample
There are several values in the ‘Address’ column that are House number (which is an integer value), street (a string value), building name (a string value), and post box number (a integer value). If there are collection of items with different data types, then consider using STRUCT data type.

Usage of complex data types will be explored in depth with several examples, in the upcoming articles.

Stay in touch and do follow to receive notifications.



3 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s