Hive SerDe – RegEx – Example5

This is a continuity part of the series “Hive SerDe Regular Expressions”.

As we discussed earlier, the Hadoop Hive regular expression functions define precise patterns of characters in the given string and are useful for extracting strings from the data and validating current data, e.g. validating the year, verifying the range, checking characters and extracting unique characters from the data.

In the previous posts, we have worked with simple to semi-complex datasets. Let’s consider some ‘logs’ data this time.

Below is the sample data:
Data_AccessLog
This is a popular dataset that roaming around so many forums.
Click here for the data to download.

Let’s see in how many patterns, we can convert this data into structured format.

Pattern-1:
CREATE TABLE Access_log1(
host string,
identity string,
user1 string,
time1 string,
request string,
status string,
size string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES(
“input.regex”=“([^ ]*) ([^ ]*) ([^ ]*) (?:-|\\[([^\\]]*)\\]) \\\”([^\\\”]*|\\\”[^\\\”]*)\\\” ([^ ]*) ([^ ]*)”,
“output.format.string”=”%1$s %2$s %3$s %4$s %5$s %6$s %7$s”
)
STORED AS TEXTFILE;

Pattern1

Explanation:

[^ ] captures anything other than space
* zero or more number of occurrences
?: means non-capturing group
– matches a “-” character
| alteration – acts like a boolean OR
\[ Escaped character – Matches a “[” character
( – is a capturing group
[^ negated set – match any character that is not in the set.
\] Escaped character – Matches a “]” character
* Match 0 or more of the preceding token.
\] Matches a “]” character.

In other words:
([^ ]*) = Captures the value “64.242.88.10”.
space = There is a space in between two strings capture. This means not to capture the value ( space)
([^ ]*) = captures “-”
space = It is outside the brackets hence will be ignored.
([^ ]*) = captures “-” because there is no other value.
(?:-|\\[([^\\]]*)\\]) = Captures “07/Jun/2016:16:47:46 -0800” by eliminating “[” and “]”
\\\”([^\\\”]*|\\\”[^\\\”]*)\\\” = Captures “GET /hadoopexam.com/bin/rdiff/Know/ReadmeFirst?rev1=1.5?rev2=1.4 HTTP/1.1” by eliminating leading and trailing double quotes (“).
space = space outside of the bracket will be ignored
([^ ]*) = captures 200
space = ignored
([^ ]*) = remaining string i.e. 5724

Pattern-2:
CREATE TABLE Accesslog(
host string,
identity string,
user1 string,
time1 string,
request string,
status string,
size string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES(
“input.regex”=“^(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+.(\\S+\\s+\\S+).\\s+.(\\S+\\s+\\S+\\s+\\S+.\\S+).\\s+(\\S+)\\s+(\\S+)$”,
“output.format.string”=”%1$s %2$s %3$s %4$s %5$s %6$s %7$s”
)
STORED AS TEXTFILE;

Pattern2

I feel this is more convenient or easy way to learn. If you know when to use “S” and “s”, you will start loving regular expressions. “S” for string and “s” for space.

^(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+.(\\S+\\s+\\S+).\\s+.(\\S+\\s+\\S+\\s+\\S+.\\S+).\\s+(\\S+)\\s+(\\S+)$

Explanation: 

^ is “start of the line”
$ is “end of the line”
(\\S+) means capture the string. Here the value will be “64.242.88.10”
\\s+ means ignore the space
(\\S+) will capture “-”
\\s+ ignores the space
(\\S+) will capture “-”
\\s+ ignores the space
. (period) will ignore any special character since it is outside the capturing group i.e. outside the brackets. Here the special character that gets ignored is “[”
(\\S+\\s+\\S+) = this captures 07/Jun/2016:16:47:46 -0800
\\S captured “07/Jun/2016:16:47:46” and \\s is included (because it is inside of the capture group) and again \\S captured the value/string “-0800”
. (period) again, to ignore special character i.e. “]”

If you are wondering what it is returning for ‘request’ column specifically in both the tables, here you go.

Regex_onlyRequestColumnFromTable

There is no difference in both the patterns.

Stay tuned for more interesting updates.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s