Complex Data Types – ARRAY within STRUCT

This is one of the articles that required a significant amount of time and effort to complete. Once an idea was formed, I went to wiki and found some relevant data to combine in a way that justified the topic title. 

We learned how to embed a complex data type STRUCT into an ARRAY datatype in previous articles. Let’s explore what happens if we put the same complex data types in reverse order, which is what happens if we put an array inside the struct.

As discussed in the previous articles, any complex data type can be the top level type for a column or can itself be an item within another complex type. In this article, we are going to see how a STRUCT can be the top-level type for a column, and also it is an item within an ARRAY.

Take a look at the data below. I collected individual celebrity information from wiki, but I may have made a mistake in doing so, so please do not depend on the information and instead look at the data format. Click on the image to enlarge.

ArrayInStructInArray_Table

The table structure should be as follows, based on the data:

Field Name Data Type
ID INT
CelebName STRING
Marriages ARRAY<STRUCT<
       SPOUSE:STRING,
       Children:ARRAY<STRING>
  >>

Below is the format of the input file.

ArrayInStructInArray_CSVData

So we know our data and we know what could be the table structure. Let’s proceed to implement it in Apache Hive.

–Creating the table in Hive
CREATE TABLE CelebMarriages
(
id BIGINT,
name STRING,
marriages ARRAY < STRUCT <
                                               spouse: STRING,
                                               children: ARRAY <STRING>
                                  > >
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
COLLECTION ITEMS TERMINATED BY ‘,’
MAP KEYS TERMINATED BY ‘:’
LINES TERMINATED BY ‘\n’;

–Importing the data into the table
LOAD DATA LOCAL INPATH ‘Desktop/Docs/test1.csv’ INTO TABLE CelebMarriages; 

–Queries / Data retrieval
SELECT name, marriages.spouse[0], marriages.children[0] FROM CelebMarriages;
SELECT name, marriages.spouse[1], marriages.children[1] FROM CelebMarriages;
SELECT name, marriages.spouse[2], marriages.children[2] FROM CelebMarriages;

ArrayInStructInArray_Output

Hope you find this article helpful.

Please subscribe for more interesting updates.

4 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s