The conversion of delimited data into columns in Apache Impala is discussed in this article. If the data is already loaded as a single column in a table, there is a “SPLIT_PART” function that helps us solve this problem. If the table has not been generated yet you can simply create a table with the FIELDS TERMINATED BY clause in the Apache Hive and load the data directly into the table; the delimited data will be converted to columns automatically. In Apache Impala, the same table can be called for querying purposes, because Impala shares Hive metadata.
Let’s assume that the table is already created and the data is loaded as a single column. Let’s look into it with an example. Consider the below data set for practice.
Data
1456, ConvertedData1, SomeOtherData1
1466, ConvertedData2, SomeOtherData2
1476, ConvertedData3, SomeOtherData4
Let’s create the table and load the data.
— Create table statement
CREATE TABLE splitStringData(dataCol1 STRING);
— Data insertion
INSERT INTO splitStringData VALUES(
(‘1456, ConvertedData1, SomeOtherData1’),
(‘1466, ConvertedData2, SomeOtherData2’),
(‘1476, ConvertedData3, SomeOtherData4’)
Let’s talk about the function – SPLIT_PART:
Description:
The purpose of the function is it returns the nth field within a delimited string. The fields are numbered starting from 1. The delimiter can consist of multiple characters, not just a single character. All matching of the delimiter is done exactly, not using any regular expression patterns. The function returns a string value.
Syntax:
split_part(string source, string delimiter, bigint n)
Implementation:
SELECT split_part(dataCol1,’,’,1),
split_part(dataCol1,’,’,2) AS Col1,
split_part(dataCol1,’,’,3) AS Col2
FROM splitStringData;
Similar implementation in other RDBMS platforms –
5 comments