Hive SerDe – RegEx – Example3

Regular expressions are difficult to understand, but not in contrast to the data they contain. You will begin to love it once you understand the structure of the regular expressions.

Often we use regular expressions for the complex data which is not in a pattern that can be handled by available functions.

Let’s consider the below datasets for practice.

Dataset-1
06-24-2020 HostRPT CheckServices
06-24-2020 HostSTG CheckLoad

Dataset-2
06-24-2020 HostRPT Check Services
06-24-2020 HostSTG Check Load

If you look at the datasets in the previous posts and in this article, for practice, we simply complicate it step by step.

Dataset-1 and Dataset-2 are similar, and the last column value separated by a space is the only distinction between the two.

Let’s start the exercises.

For the Dataset-1:
CREATE TABLE SysResults_TAB(
date STRING,
hostname STRING,
command STRING
)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES (
“input.regex” = “^(\\S+)\\t+(\\S+)\\t+(\\S+)$”,
“output.format.string” = “%1$s %2$s %3$s”
)
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH ‘Desktop/SysResults_TAB.txt’ INTO TABLE SysResults_TAB;


As discussed earlier, the ^ signifies beginning of the string and $ is to match the ending of the input string. The \t which is outside the brackets matches a tab. \S+ (uppercase S) matches anything that is NOT matched by \s, i.e., non-whitespace. In regex, the uppercase metacharacter denotes the inverse of the lowercase counterpart.

In simple words,
(\\S) fetches “06-24-2020”
\\t ignores ‘tab’ since it’s outside of the brackets.
(\\S) captures “HostRT”
\\t ignores ‘tab’
(\\S) captures “CheckServices”

Let’s go for the Dataset-2:

CREATE TABLE SysResults_TAB(
date STRING,
hostname STRING,
command STRING
)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES (
“input.regex” = “^(\\S+)\\t+(\\S+)\\t+(\\S+\\s+\\S+)$”,
“output.format.string” = “%1$s %2$s %3$s”
)
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH ‘Desktop/SysResults_TAB.txt’ INTO TABLE SysResults_TAB;




The explanation is almost same except for the last match. (\\S+\\s+\\S+) is to match the string “Check Services”; It is not a single string but it’s got two words. The first \\S captures “Check” and the last \\S captures “Services” since the space identifier is inside the brackets it behaves like a concatenation. In simple words, String + space + String.

Please note that, the two above-mentioned datasets are not complex data for which regular expressions are needed. It can easily be achieved using the code below.

CREATE TABLE SysResults_TAB(
date STRING,
hostname STRING,
command STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’;


Let’s consider more complex datasets in the upcoming articles.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s