REGEXP_EXTRACT in Apache Pig

Regular expression functions identify precise patterns of characters in the given string and returns the matching text item from the string/data.

Syntax: REGEXP_EXTRACT(STRING data, STRING regex pattern, INT index)

Let’s see practically how it works.

Sample data:

Save the below data in a CSV file.
192.168.1.1:1433
192.168.1.2:9020
192.168.1.3:9012
192.168.1.4:8020
192.168.1.5:1123
192.168.1.6:1124

Loading the data into relation.
ipdata = LOAD ‘Desktop/Docs/ipaddr.csv’ USING PigStorage() as (data:chararray);

Splitting the data based on regular expression.
ipAddrs = FOREACH ipdata GENERATE REGEX_EXTRACT(data, ‘(.*):(.*)’, 1);
ipPorts = FOREACH ipdata GENERATE REGEX_EXTRACT(data, ‘(.*):(.*)’, 2);

With the aforementioned approach, the relation “ipAddrs” will only include IP addresses, whereas “ipPorts” will have information about ports. Let’s get the information.

Data retrieval:
DUMP ipAddrs;

Result:
(192.168.1.1)
(192.168.1.2)
(192.168.1.3)
(192.168.1.4)
(192.168.1.5)
(192.168.1.6)

Data retrieval:
DUMP ipPorts;

Result:
(1433)
(9020)
(9012)
(8020)
(1123)
(1124)

And, if the string has to be divided and the data shown in two columns, you can combine the two REGEXP EXTRACT commands into a single expression, as seen below.

Splitting the data based on regular expression.
ipAddrs = FOREACH ipdata
GENERATE REGEX_EXTRACT(data, ‘(.*):(.*)’, 1),REGEX_EXTRACT(data, ‘(.*):(.*)’, 2);

Result:
(192.168.1.1, 1433)
(192.168.1.2, 9020)
(192.168.1.3, 9012)
(192.168.1.4, 8020)
(192.168.1.5, 1123)
(192.168.1.6, 1124)

Hope you find this article helpful.

Please subscribe for more interesting updates.