This is going to be a series of articles to discuss about complex data types that are available in Apache Hive. There will be too many simple to complex scenarios for you to learn.
Introduction:
Hadoop is designed to manage big data as well as complex data that cannot be managed using primitive data types. Complex data types are nested data structures made up of primitive data types, or these data structures can often be made up of other complex types. Most programming languages like Python, C++, Java, Apache Hive and Impala supports complex data types such as Array, Struct, Map and Union.
Let’s go deep to understand the primitive and complex datatypes.
Now look at the below table:
The datatypes for the above table should be as follows:
This is as per relational databases. In Hive, we use STRING instead of VARCHAR. Well, this was an example to illustrate the primitive data types.
Look at the below datasets to understand the “complex data types”
ARRAY:
The above data cannot be implemented using traditional systems, since it is against to its basic rules. In the first example, the column Teammates is having array of ‘string’ values. Similarly in the second example, the column MarksInAllSubjects is having array of “Integer” values.
If there is a collection of items with similar data types, then use ARRAY data type.
Now look at the below dataset.
MAP:
There are several key and value pairs in the columns in the above examples. The subjects of mathematics, science, physics, chemistry are considered as keys in the first case, and the marks are known as their values. Similarly in the second example, the column “Births Per Year” has key value pairs. “Year” is the key and the count of births is its value.
If the data is in such key-value pairs, consider ‘MAP’ datatype to use.
Now, to understand STRUCT, look at the below example.
STRUCT:
There are several values in the ‘Address’ column that are House number (which is an integer value), street (a string value), building name (a string value), and post box number (a integer value). If there are collection of items with different data types, then consider using STRUCT data type.
Usage of complex data types will be explored in depth with several examples, in the upcoming articles.
Stay in touch and do follow to receive notifications.
3 comments