Hive SerDe – RegEx – Example1

SerDe: is short-form for Serializer/Deserializer.

A SerDe allows Hive to read in data from a table, and write it back out to HDFS in any custom format. The SerDe interface allows you to instruct Hive as to how a record should be processed. Anyone can write their own SerDe for their own data formats.

Regex stands for a regular expression. The Hadoop Hive regular expression functions define precise character patterns in a given string and are useful for extracting strings from existing data.

In other words, whenever you want to have a kind of pattern matching, based on the pattern matching, you have to store the fields.

RegexSerDe is present in org.apache.hadoop.hive.contrib.serde2.RegexSerDe.

We have to define our own input pattern and output fields in SerDeProperties, and we will learn how to deal with such input patterns in this post and in future posts on various types of data.

Let us consider the below as an example.
“200$6459″

Our aim is split this data into fields and store in Hive table.

So, let’s start the exercise.

CREATE TABLE sampleTab1(col1 string, col2 string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES(
“input.regex”=”([^ ]*)\\$([^ ]*)“,
“output.format.string”=”%1$s %2$s”)
STORED AS TEXTFILE;

Let me explain the input regular expression referenced above. ^ is a symbol to signify that it’s the beginning of the line. The asterisk sign means zero or more characters. Now, [^ ] means that we are instructing Hive to start capturing the character by ignoring the space; this means that only one character is about to be captured. But the asterisk symbol instructs Hive to grab a string of zero or more characters. \$ means that we’re asking Hive to overlook the dollar sign between the strings.

We have created the table with the regular expression, now it’s time to load the data and see if its working.

LOAD DATA LOCAL INPATH ‘Desktop/DollarDelimter.txt’ INTO TABLE sampleTab1;

Please remember, it’s a sample data. In reality, we don’t need to use regular expression if the data is in this format. The data can be split in different ways as mentioned below.

CREATE TABLE Test5(col1 INT, col2 int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘$’;

LOAD DATA LOCAL INPATH ‘Desktop/DollarDelimter.txt’ INTO TABLE Test5;

Another method is using “SPLIT” function:

Let’s create a dummy table to hold the data.
CREATE TABLE sampleTest(data string);
The entire line will be considered as a string here.

Let’s load the data into the table.
LOAD DATA LOCAL INPATH ‘Desktop/DollarDelimter.txt’ INTO TABLE sampleTest;

Now, let us create the table with the column names with proper data types.
CREATE TBALE sampleTest1(col1 INT, col2 INT)

Use the split function to split and cast function to convert the data type to INT.
INSERT INTO sampleTest1
SELECT CAST(split(data, ‘\\$’)[0] AS INT),
CAST(split(data,’\\$’)[1] AS INT) FROM sampleTest;

This is an example for practice.

Let’s do some more exercises on regular expressions. Click here for the next part.