The dataset below is one of the most popular and active dataset on the Internet. Regular expression has been written for this log by a few developers. However, our goal is to learn the regex patterns and see how many methods we can use to translate this log to a structured format.
66.249.68.6 – – [14/Jan/2012:06:25:03 -0800] “GET /example.com HTTP/1.1” 200 708 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
66.249.68.5 User1 Host1 [14/Jan/2012:06:25:03 -0900] “GET /example.com HTTP/1.1” 300 908 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
(Click on the image to enlarge)
Let’s see the first pattern.
Method-1
CREATE TABLE apache_combined_log (
ipaddr STRING,
user STRING,
host STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES (
“input.regex” = “([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \”]*|\”[^\”]*\”) (-|[0-9]*) (-|[0-9]*)(?: ([^ \”]*|\”[^\”]*\”) ([^ \”]*|\”[^\”]*\”))?”,
“output.format.string” = “%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s”
)
STORED AS TEXTFILE;
Let’s load the data now.
LOAD DATA LOCAL INPATH ‘Desktop/SerDe_ApacheLog.txt’ INTO TABLE apache_combined_log;
The data will be loaded without any errors. Let’s randomly check few columns from the table.
SELECT ipaddr, user, host FROM apache_combined_log;
SELECT time, agent, referer, size FROM apache_combined_log;
Explanation:
There are 9 capturing groups in the regex input pattern.
66.249.68.5 User1 Host1 [14/Jan/2012:06:25:03 -0900] “GET /example.com HTTP/1.1” 300 908 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
“([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \”]*|\”[^\”]*\”) (-|[0-9]*) (-|[0-9]*)(?: ([^ \”]*|\”[^\”]*\”) ([^ \”]*|\”[^\”]*\”))?”
1st Capturing Group ([^ ]*)
will try to match any single character except space, The asterisk (*) quantifier makes it more than one character since it matches between zero and unlimited times, as many times as possible. Thus it makes a string and fetches the value “66.249.68.5”.
There’s a space between 1st and 2nd capturing groups which means space need to be ignored.
2nd Capturing Group ([^ ]*)
Similar to the previous, this group matches any single character and the asterisk quantifier makes it a string (multiple characters). It fetches the value “User1”
3rd Capturing Group ([^ ]*)
Similar to the previous, this group also matches a string value and fetches “Host1” from the line.
4th Capturing Group (-|\\[[^\\]]*\\])
This group has two alternatives. The 1st alternative is it tries to match the character “-“, if not found (pipe symbol signifies “OR”).
2nd Alternative \\[[^\\]]*\\] will try to match the character ‘[‘ and eliminates it from capturing. Also ignores the character ‘]’. Whatever is available in between will be captured. Thus it captures “14/Jan/2012:06:25:03 -0900”
5th Capturing Group ([^ \”]*|\”[^\”]*\”)
This group has also two alternatives since we used “|” (pipe) symbol. The first alternative [^ \”]* will capture any number of characters except double quotes. The second alternative does the same job too. It also ignores the double quote and captures the string value. Finally, it fetches the value – GET /example.com HTTP/1.1
6th Capturing Group (-|[0-9]*)
This group has also two alternatives, either it fetches “-” character or any digit (any number of digits since asterisk quantifier is specified). Thus it fetches the value “300”
7th Capturing Group (-|[0-9]*)
Similar to the previous group it captures the value “908”
?: groups multiple tokens together without creating a capture group.
8th Capturing Group ([^ \”]*|\”[^\”]*\”)
This groups multiple tokens together and creates a capturing group for extracting a substring.
This captures “-” from the given string.
9th Capturing Group ([^ \”]*|\”[^\”]*\”))?
Similar to the previous this captures “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
Now, let’s see another approach to convert the log into structured format.
Method-2
CREATE TABLE apache_combined_log2(
ipaddr STRING,
user STRING,
host STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES (
“input.regex” = “^(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+.(\\S+\\s+\\S+).\\s+.(\\S+\\s+\\S+\\s+\\S+.\\S+).\\s+(\\S+)\\s+(\\S+)\\s+.(\\S+).\\s+.(\\S+\\s+\\S+\\s+\\S+\\s+.\\S+).$”,
“output.format.string” = “%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s”
) STORED AS TEXTFILE;
Capital “S” is for String and small “s” is for space. ^ represents beginning of the line and “$” represents end of the line. The input regex here is a self-explanatory and we already discussed about this pattern in the previous posts.
Please feel free to touch base with me in case you need more clarification.
One comment