Hive SerDe – RegEx – Example4

“Build step by step. Push yourself, but not too hard. Learn, keep it fun.”

It is definitely harder to read and comprehend 100 characters in a regular expression syntax. But if there’s a proper learning pattern, it’s not impossible to master. Below dataset is an example that we are increasing the complexity step by step. The upcoming articles will have more complex data.

Dataset-3
06-24-2020 HostRPT Check Services Srv1 OK;Srv2 OK;ProcS OK; Checked State:5720
06-24-2020 HostSTG Check Load Cmpt OK; Usage;36%;Load Average:54.61, 56.95

It is a tab delimited data however there are multiple other delimiters (space, semicolon, colon, etc.) exist in the third and fourth fields.

Let’s see in how many patterns we convert this data into Hive’s structured format.

Pattern-1:
CREATE TABLE sys_results(
date STRING
,hostname STRING
,command STRING
,status STRING
)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES (
“input.regex” = “^([^\t]*)\t+([^\t]*)\t+([^\t]*)\t+(.*)”,
“output.format.string” = “%1$s %2$s %3$s %4$s”
)
STORED AS TEXTFILE;

–Load the data in the table
LOAD DATA LOCAL INPATH ‘Desktop/Serde_sysResults.txt’ INTO TABLE sys_results;

Explanation:
> ^ matches the beginning of a line
> ([^\t]*) is a capturing group for extracting a character except [^\t] tab. Asterisk (*) means zero or more characters. Thus it makes capturing a string.
> \t+ means escape a tab.
> . (dot) matches any character except line breaks.

So, the first capturing group ([^\t]*) captures 06-24-2020
> \t+ escapes a tab delimiter
> second capturing group ([^\t]*) captures “HostRPT”
> \t+ escapes a tab delimiter
> third capturing group ([^\t]) captures “Check Services” since we are not escaping space.
> \t+ escapes a tab delimiter
> fourth capturing group (.*) captures everything before the line break.

Pattern-2:
CREATE TABLE sys_results(
date STRING
,hostname STRING
,command STRING
,status STRING
)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES (
“input.regex” = “^(\\S+)\\t+(\\S+)\\t+(\\S+\\s+\\S+)\\t+(.+)”,
“output.format.string” = “%1$s %2$s %3$s %4$s”
)
STORED AS TEXTFILE;

–Load the data in the table.
LOAD DATA LOCAL INPATH ‘Desktop/Serde_sysResults.txt’ INTO TABLE sys_results;

Explanation:
> ^ is for beginning of the line.
> (\\S+) captures a string except a space.
> \\t+ escapes a tab.
> (\\S+\\s+\\S+) is to match the string “Check Services”; It is not a single string but it’s got two words. The first \\S captures “Check” and the last \\S captures “Services” since the space identifier is inside the brackets it behaves like a concatenation. In simple words, String + space + String.
> (.+) is to capture anything except line breaks.

Again, the above-mentioned dataset is not a complex data for which regular expression is needed. It can easily be achieved using the code below.

Pattern-3
CREATE TABLE sysData(
date STRING
,hostname STRING
,command STRING
,status STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’;

LOAD DATA LOCAL INPATH ‘Desktop/Serde_sysResults.txt’ INTO TABLE sysData;

Let’s consider more complex datasets in the upcoming articles.