Regular expression functions identify precise patterns of characters in the given string and returns the matching text item from the string/data.
Syntax: REGEXP_EXTRACT(STRING data, STRING regex pattern, INT index)
Let’s see practically how it works.
Sample data:
Save the below data in a CSV file.
192.168.1.1:1433
192.168.1.2:9020
192.168.1.3:9012
192.168.1.4:8020
192.168.1.5:1123
192.168.1.6:1124
Loading the data into relation.
ipdata = LOAD ‘Desktop/Docs/ipaddr.csv’ USING PigStorage() as (data:chararray);
Splitting the data based on regular expression.
ipAddrs = FOREACH ipdata GENERATE REGEX_EXTRACT(data, ‘(.*):(.*)’, 1);
ipPorts = FOREACH ipdata GENERATE REGEX_EXTRACT(data, ‘(.*):(.*)’, 2);
With the aforementioned approach, the relation “ipAddrs” will only include IP addresses, whereas “ipPorts” will have information about ports. Let’s get the information.
Data retrieval:
DUMP ipAddrs;
Result:
(192.168.1.1)
(192.168.1.2)
(192.168.1.3)
(192.168.1.4)
(192.168.1.5)
(192.168.1.6)
Data retrieval:
DUMP ipPorts;
Result:
(1433)
(9020)
(9012)
(8020)
(1123)
(1124)
And, if the string has to be divided and the data shown in two columns, you can combine the two REGEXP EXTRACT commands into a single expression, as seen below.
Splitting the data based on regular expression.
ipAddrs = FOREACH ipdata
GENERATE REGEX_EXTRACT(data, ‘(.*):(.*)’, 1),REGEX_EXTRACT(data, ‘(.*):(.*)’, 2);
Result:
(192.168.1.1, 1433)
(192.168.1.2, 9020)
(192.168.1.3, 9012)
(192.168.1.4, 8020)
(192.168.1.5, 1123)
(192.168.1.6, 1124)
Hope you find this article helpful.
Please subscribe for more interesting updates.