Similar to the previous article, the dataset below is also common and a good example to learn regular expressions.
Dataset (for copying purpose):
82.133.98.11 – – [15/May/2019:07:47:12 -0900] “GET /org.apache.com/bin/param/Vars/ReadmeFirst?view1=1.5?view2=1.4 HTTP/1.1” 299 3939
82.133.98.11 – – [15/May/2019:07:55:17 -0900] “GET /org.apache.com/bin/view/Page/org.apache.comGroups?view=1.2 HTTP/1.1” 299 4949
82.133.98.11 – – [15/May/2019:08:11:19 -0900] “GET /org.apache.com/bin/param/Page/ConfigurationVariables HTTP/1.1” 299 56789
82.133.98.11 – – [15/May/2019:08:22:11 -0900] “GET /org.apache.com/bin/edit/Page/Flush_service_name?topicparent=Page.ConfigurationVariables HTTP/1.1” 401 12345
82.133.98.11 – – [15/May/2019:09:33:20 -0900] “GET /org.apache.com/bin/param/org.apache.com/org.apache.comRegistration HTTP/1.1” 299 34567
82.133.98.11 – – [15/May/2019:09:44:41 -0900] “GET /org.apache.com/bin/param/Page/NicholasLee HTTP/1.1” 299 7890
82.133.98.11 – – [15/May/2019:10:35:20 -0900] “GET /org.apache.com/bin/view/Sandbox/WebHome?view=1.6 HTTP/1.1” 299 1234
82.133.98.11 org.apache_user host1 [15/May/2019:10:18:22 -0900] “GET /org.apache.com/listinfo/administration HTTP/1.1” 299 5454
Let’s see how many ways / approaches to convert this data into structured format.
Method-1
CREATE TABLE Accesslog(
ipaddr string,
user string,
host string,
datetime1 string,
request string,
status string,
size string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES(
“input.regex”=”^(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+.(\\S+\\s+\\S+).\\s+.(\\S+\\s+\\S+\\s+\\S+.\\S+).\\s+(\\S+)\\s+(\\S+)$”,
“output.format.string”=”%1$s %2$s %3$s %4$s %5$s %6$s %7$s”
)
STORED AS TEXTFILE;
Now, load the data into the table.
Explanation:
“S” represents string
“s” represents space
“.” (period) represents any special character
Inside the brackets – Capturing group
Outside the brackets – Non-capturing group
^ represents start of the line
$ represents end of the line
Method-2
CREATE TABLE Access_log1(
ipaddr string,
user string,
host string,
datetime1 string,
request string,
status string,
size string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES(
“input.regex”=”([^ ]*) ([^ ]*) ([^ ]*) (?:-|\\[([^\\]]*)\\]) \\\”([^\\\”]*|\\\”[^\\\”]*)\\\” ([^ ]*) ([^ ]*)”,
“output.format.string”=”%1$s %2$s %3$s %4$s %5$s %6$s %7$s”
)
STORED AS TEXTFILE;
Now, load the data into the table.
Explanation:
82.133.98.11 org.apache_user host1 [15/May/2019:10:18:22 -0900] “GET /org.apache.com/listinfo/administration HTTP/1.1” 299 5454
([^ ]*) ([^ ]*) ([^ ]*) (?:-|\\[([^\\]]*)\\]) \\\”([^\\\”]*|\\\”[^\\\”]*)\\\” ([^ ]*) ([^ ]*)
There are seven capturing groups in the regex.
1st Capturing Group ([^ ]*)
will try to match any single character except space, The asterisk (*) quantifier makes it more than one character since it matches between zero and unlimited times, as many times as possible. Thus it makes a string and fetches the value “82.133.98.11”.
There’s a space between 1st and 2nd capturing groups which means space need to be ignored.
2nd Capturing Group ([^ ]*)
Similar to the previous, this group matches any single character except space and the asterisk quantifier makes it a string (multiple characters). It fetches the value “org.apache.user”
3rd Capturing Group ([^ ]*)
Similar to the previous, this group also matches a string value and fetches “host1” from the line.
?: groups multiple tokens together without creating a capture group.
4th Capturing Group -|\\[([^\\]]*)\\])
This group has two alternatives. The 1st alternative is it tries to match the character “-“, if not found (pipe symbol signifies “OR”).
2nd Alternative \\[([^\\]]*)\\]) will try to match the character ‘[‘ and eliminates it from capturing. Also ignores the character ‘]’. Whatever is available in between will be captured. Thus it captures “15/May/2019:07:47:12 -0900”
5th Capturing Group \\\”([^\\\”]*|\\\”[^\\\”]*)\\\”
This group has also two alternatives since we used “|” (pipe) symbol. The first alternative \\\”[^ \\\”]* will capture any number of characters except double quotes. The second alternative does the same job too. It also ignores the double quote and captures the string value. Finally, it fetches the value – GET /org.apache.com/bin/param/Vars/ReadmeFirst?view1=1.5?view2=1.4 HTTP/1.1
6th Capturing Group ([^ ]*)
Will try to match any single character except space, The asterisk (*) quantifier makes it more than one character since it matches between zero and unlimited times, as many times as possible. Thus it makes a string and fetches the value “299”.
7th Capturing Group ([^ ]*)
Similar to the previous group it captures the value “3939”
Results:
Other column values from the table.
One comment