Complex Data Types – STRUCT

STRUCT in Hive is similar to STRUCT in the C programming language. It’s a record type that contains a collection of named fields that can be of any primitive data type. Using the DOT (.) notation, we can access the elements of the STRUCT type.

In simple words, STRUCT, in contrast to “Array,” is a collection of different data.

Let’s consider the below dataset.

Students.csv
First Name|Last Name|Address|Mobile|ZipCode
Paula|Foote|3204, Stratford Court, Clinton, NC|919-726-5752|011-12345|28328
Eugene|Diller|2324, Smith Road, Douglasville, GA|770-942-1596|011-12346|30134
Caroline|Carter|4676 Lucy Lane, Montgomery, IN|812-486-6069|011-12347|47558
Lesli|Maker|1479 Red Maple Drive, Irvine, CA|323-474-0535|011-12348|92614

Please accomplish the following activities:

  • Load this CSV file in HDFS.
  • Create a Hive table in the following format
    • Table Name: StudentsInfo
      • Name < First_Name, Last_Name>
      • Address< HouseNo, LocalityName, City, State, Zip>
      • Phone < Mobile,Landline>
  • Make sure the Hive table store data in a Sequence File format.
  • Location of data should be /user/hive/warehouse/students/

Solution:
Let’s create the table as per the requirement.

CREATE TABLE IF NOT EXISTS StudentsInfo(
Name STRUCT < First_Name:string, Last_Name:string>,
Address STRUCT< HouseNo: string, LocalityName: string, City: string, State: string, Zip: string>,
Phone STRUCT<Mobile: string, Landline: string>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘I’
COLLECTION ITEMS TERMINATED BY ‘,’;

If you look at the data, the column position is different. We cannot save the data into the table directly. Hence let’s load the data into a temporary table.

CREATE TABLE tempStudentsInfo (data STRING) STORED AS TEXTFILE;
LOAD DATA INPATH cloudera/bigdatansql/studentsinfo.csv INTO TABLE tempStudentsInfo;

Now, if you see, all the data is loaded into a single column. Let’s use the SPLIT function which splits the data based on the delimiter provided and it is mostly used function in Apache Hive.

SELECT
split(data,’\\|’)[0] First_Name ,
split(data,’\\|’)[1] Last_Name ,
split(split(data, ‘\\|’)[2],”,”)[0] HouseNo ,
split(split(data, ‘\\|’)[2],”,”)[1] LocalityName ,
split(split(data, ‘\\|’)[2],”,”)[2] City ,
split(split(data, ‘\\|’)[2],”,”)[3] State ,
split(data, ‘\\|’)[3] Mobile ,
split(data, ‘\\|’)[4] Phone ,
split(data, ‘\\|’)[5] Zip
FROM tempStudentsInfo;

The result of the above query displays the data into different column values.
Now, use the same query to Insert the data from the temp table to the actual table.

INSERT OVERWRITE TABLE StudentsInfo
SELECT named_struct(“First_Name”, split(data, ‘\\|’)[0],

“Last_Name” , split(data, ‘\\|’)[1] ),
named_struct (“HouseNo”,split(split(data, ‘\\|’)[2],”,”)[0],
“LocalityName”,split(split(data, ‘\\|’)[2],”,”)[1],
“City” , split(split(data, ‘\\|’)[2],”,”)[2] ,
“State” , split(split(data, ‘\\|’)[2],”,”)[3] ,
“Zip”,split(data, ‘\\|’)[5]),
named struct ( “Mobile”,split(data, ‘\\|’)[3],
“Landline”,split(data, ‘\\|’)[4])
FROM tempStudentsInfo;

Hope you find this article helpful.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s