If you’ve used Hive’s SPLIT function, Apache Impala’s SPLIT PART function, or SQL Server’s STRING SPLIT function, you won’t need an introduction to Apache Pig’s STRSPLIT function, which divides the words from the input text, into multiple chunks/parts based on the provided expressions.
Let’s do some practice exercises for better understanding.
Sample data:
The below data contains year of the movie released and the movie title. Save this data in a CSV file.
1969_DownhillRacer
1970_M*A*S*H
1970_ThePartyAtKittyAndStud’s
1970_LoversAndOtherStrangers
1970_TheSidelongGlancesOfAPigeonKicker
1970_HerculesInNewYork
1971_Bananas
1971_Klute
1972_What’sUp,Doc?
1973_NoPlaceToHide
Loading data into a relation
rawdata = LOAD ‘Desktop/Docs/movies.csv’ USING PigStorage() as (data:chararray);
Splitting the data using STRSPLIT function.
moviedata = FOREACH rawdata GENERATE STRSPLIT(data, ‘_’, 2);
Retrieve data.
DUMP moviedata;
Result:
((1969,DownhillRacer))
((1970,M*A*S*H))
((1970,ThePartyAtKittyAndStud’s))
((1970,LoversAndOtherStrangers))
((1970,TheSidelongGlancesOfAPigeonKicker))
((1970,HerculesInNewYork))
((1971,Bananas))
((1971,Klute))
((1972,What’sUp,Doc?))
((1973,NoPlaceToHide))
Hope you find this article helpful.
Please follow us for more interesting updates.