Is Complex Data Really Complex To Handle?

Let’s split the title-question into two. What is the complex data and how complex it is to handle.

First, let’s see what complex data is.

Any data that does not fit into relational database structures can be considered as complex data. Maps, system or application created logs, images, videos, streaming data are examples of complex data.

Now, answer to the second question.

Once you start loving the data, once you realize how valuable the data is, once you know how the data can help the company, once you know the functions that can convert the data into a readable format, there is no complexity in handling it.

Data sometimes tricky or messy. Social networking is increasing exponentially and produces a lot of data that can be unstructured and can contain too many different kinds of data from XML to video to SMS. However, translating the information in a proper manner is not an easy process, but once you know which method to use, there is little challenge at all.

The above data should be converted into the following format.

CustomerInfo
Name < First Name, Last Name>
Address<HouseNo, LocalityName, City, Zip>
Phone <Mobile,Landline>

If you look at the data, you’ll immediately think about the string or regex functions to convert the data into a readable table format. However, this cannot be done by the usual string function and cannot be handled by regular expressions, not at least in a single shot. However, Apache Hive has so many functions in its store that the SPLIT function is one of them.

The Split function separates text on the basis of the given delimiter and places each fragment in a separate cell in a row.

Let’s see how it can be done.

Creating Hive table:

Use dbTest;

CREATE TABLE IF NOT EXISTS CustomerInfo(
Name STRUCT <First_Name:string, Last_Name:string>,
Address STRUCT<HouseNo:string, Locality:string, City:string, Zip:string>,
Phone STRUCT<Mobile:string, Landline:string>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘I’
COLLECTION ITEMS TERMINATED BY ‘,’;

Assuming that the data is stored in a single column in a temp table within the Hive metastore.

Conversion:
INSERT OVERWRITE TABLE CustomerInfo
SELECT named_struct(“First_Name”, split(data, ‘\\|’)[0],
“Last_Name” , split(data, ‘\\|’)[1] ),
named_struct (“HouseNo”,split(split(data, ‘\\|’)[2],”,”)[0],
“Locality”,split(split(data, ‘\\|’)[2],”,”)[1],
“City” , split(split(data, ‘\\|’)[2],”,”)[2] ,
“Zip”,split(data, ‘\\|’)[5]),
named_struct( “Mobile”,split(data, ‘\\|’)[3],
“Landline”,split(data, ‘\\|’)[4] )
FROM tempCustInfo;