Complex Data Types – ARRAY within MAP

Please click here for “ARRAY in MAP”

The data is available in a number of formats, volumes, and types. It may be well-organized, partially-organized, or completely disorganized. Numbers or text may be used. It could be in a relational format, or it could be too large for row-column or tabular databases. Finally, data aids in the understanding and improvement of business and business processes, allowing you to save resources and time.

We’ll discuss “complex data” which is a topic that comes up often in current scenarios. As previously stated, data is derived from a variety of sources and in a variety of formats, one of which is nested values. Not everything fits in rows and columns. Not every data bound to Codd’s rules. Sometimes, if we try to put everything in rows and columns it can cause redundancy too. For example, look at the below data.

StatePop_ArrayInStruct_Data_Sample

The above information pertains to the most famous towns and cities in each Indian state. When you look at the data, you’ll see that the “State” detail is replicated every time. This supposed to be in the following format to avoid redundancy.

StatePop_ArrayInStruct_Data_Sample2
The above data is more difficult to process and interpret than simple data, and we can’t store it with traditional systems’ primitive data types. There is no other option but to store the data as a string in a single cell if we depend on RDBMS. If we save the array values in primitive data types as a single cell, it would be difficult to retrieve individual items from the array values. This is where “complex data types” are required.

So, let’s continue learning about complex data types in this post.

I obtained the below data of famous cities and towns in Indian states, as well as their population. This information is obtained from wiki. We’ll use this data to see how to handle an Array when it’s contained within a MAP complex data type. Also, we’ll see how to summarize, process and group the data while working with nested complex data types.

StatePop_ArrayInStruct_Data

Let’s begin the exercise.

Since the data is in rows and columns, let’s store it in a normal table.

–Creating the table with primitive data types to store the source data which is in rows and columns.
CREATE TABLE StatePop(City String, State String, Pop BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’;

–Loading the data into the table
LOAD DATA LOCAL INPATH ‘Desktop/Docs/IndianStatesPop.csv’ INTO TABLE StatePop;

StatePop_ArrayInStruct

–Now, create a table with complex data types.
CREATE TABLE StatePop_Complex(
Pop BIGINT,
StateCity STRUCT <State: STRING, City: ARRAY <STRING> >);

–Inserting the values from the source table to newly created table.
INSERT INTO StatePop_Complex
SELECT SUM(pop),
named_struct(‘State’, State, ‘City’, collect_list(City))
FROM StatePop
GROUP BY State

–The data has been populated. Let’s verify it.
SELECT StateCity.State, StateCity.City FROM StatePop_Complex;

StatePop_ArrayInStruct2
StatePop_ArrayInStruct3