This is a continuity part of the series “Hive SerDe Regular Expressions”. Most probably, the series will end before it hits 10.
RegexSerDe uses regular expression (regex) for serialization/deserialization. You can deserialize data using regex and extract groups as columns. You can also serialize a row object using a string format.
In the deserialization point, if the row does not match the regex, all columns in the row will be Empty. If the row matches the regex but has fewer groups than anticipated, the missed groups would be NULL. If the row fits the regex but has more than planned groups, the extra groups would be skipped.
In the serialization point, the Java string formatter is used to format the columns into a row. If the output type of the column in the query is not a string, it will be converted to the Hive string automatically.
NOTE: Apparently, all columns must be strings. Users may use “CAST(col1 AS INT)” to convert columns to other types.
Let’s reproduce the error for the above statement “In the deserialization point, if the row does not match the regex, all columns in the row will be Empty.”
Sample Data: Saved as a CSV file – NewOrders.CSV
1234 Tech Company $In Process
2345 Softech Co $Delivery Completed
3456 – $Delivery Confirmed
4567 SysCoAgency $Delivery Confirmed
If you look at the details (though it is obvious), there are three columns, Order ID, Company Name, Shipping Status. The first and second columns are space-separated. The second and third columns are divided by a space and a dollar sign ($). Company Name and Shipping Status columns have two space-delimited words/strings (e.g. Tech Company).
As mentioned above, there is far less possibility of getting this sort of obvious info. However, our goal is to work out how to use regex patterns to translate different types of data.
Let’s begin the exercise.
CREATE TABLE neworders1(orderid string, company string, order_status string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES(
“input.regex”=”^(\\S+)\\s+(\\S+\\s+\\S+)\\s.+(\\S+\\s+\\S+)$”,
“output.format.string”=”%1$s %2$s %3$s”)
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH ‘Desktop/neworders.csv’ INTO TABLE neworders1;
Now, we have created and loaded the data into the table. Let’s see what we’ve got –
1234 Some Company Process
2345 Another Company Confirmed
NULL NULL NULL
4567 Some Agency Confirmed
NULL NULL NULL
Though we have 4 lines data in our CSV, it is showing one additional line with NULL values. It is because of preformatted CSV file. If there is a chance convert the data into plain text file. Usually there will be log files or text files that needed to be converted using regular expressions. So if there is a possibility switch over to TXT file rather than CSV. Anyways, we can ignore the row for now. If you look at the 3rd line, all the rows of the line is NULL. It is because company name is provided as ‘-‘ (dash symbol) in the input file and the regex (\\S+\\s+\\S+) couldn’t identify the value. So, if the row does not match the regex, all columns in the row will be Empty.
If you look at the data again, the third column is two word string value, however when imported it is showing only one part of it. The input regex identified multiple strings than we expected. If the row fits the regex but has more than planned groups, the extra groups would be skipped.
Let’s reproduce the error for the next statement “If the row matches the regex but has fewer groups than anticipated, the missed groups would be NULL.”
Now consider the same data file “neworders.csv”.
CREATE TABLE neworders(orderid int,company string,order_status string)
row format serde ‘org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES(
“input.regex”=”^(\\S+)([^|]+)\\$(.*$)”,
“output.format.string”=”%1$s %2$s %3$s”);
LOAD DATA LOCAL INPATH ‘Desktop/neworders.csv’ INTO TABLE neworders;
Now let’s see what we’ve got in the table.
NULL Tech Company In Process
NULL Softech Co Delivery Completed
NULL – Delivery Confirmed
NULL SysCoAgency Delivery Confirmed
Though the other two columns are perfect and the data is as expected. However, the first column got missed due to the regex pattern “^(\\S+)([^|]+)\\$(.*$)” we provided. There is a pipe symbol in the regex that caused the issue. In regex patterns pipe symbol is like a either this or that kind of conditional statement. It considered the next two strings (after the pipe symbol) and ignored the first one.
Ok now let’s see how many ways regular expression can be written to convert the above said data.
Sometimes I used “neworders.txt” and “neworders.csv”
CREATE TABLE neworders2(orderid string,company string,order_status string)
row format serde ‘org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES(
“input.regex”=“^(\\S+)\\s+([^$]+)\\$(.+)$”,
“output.format.string”=”%1$s %2$s %3$s”);
CREATE TABLE neworders2(orderid string,company string,order_status string)
row format serde ‘org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES(
“input.regex”=”^(\\S+)\\s+([^$]+)\\$(.*$)“,
“output.format.string”=”%1$s %2$s %3$s”);
CREATE TABLE neworders(orderid string,company string,order_status string)
row format serde ‘org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES(
“input.regex”=”^(\\d+)\\s+([^$]+)\\$(.*$)”,
“output.format.string”=”%1$s %2$s %3$s”);
CREATE TABLE neworders(orderid int,company string,order_status string)
row format serde ‘org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES(
“input.regex”=”^(\\d+)\\s+([^$]+)\\$(.*$)”,
“output.format.string”=”%1$s %2$s %3$s”);
CREATE TABLE neworders(orderid int,company string,order_status string)
row format serde ‘org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES(
“input.regex”=”^(\\S+)\\s+([^$]+)\\$(.*$)”,
“output.format.string”=”%1$s %2$s %3$s”);
CREATE TABLE neworders(orderid int,company string,order_status string)
row format serde ‘org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES(
“input.regex”=”^(\\S+)([^$]+)\\$(.*$)”,
“output.format.string”=”%1$s %2$s %3$s”);
Hope you find this post helpful.
One comment