Pig is a high-level scripting language used in Apache Hadoop. Pig allows to write dynamic data transformations without the awareness of Java. Pig’s basic SQL-like scripting language is called Pig Latin and appeals to developers already acquainted with SQL and scripting languages. The below is a sample word-count program to let you know how to work with Pig.
Dataset:
forWordCount.txt
“This is not a text data but a test data to test word count program. “
Code:
input_lines = LOAD ‘/user/cloudera/forWordCount.txt’ AS (line:chararray);
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES ‘\\w+’;
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
DUMP ordered_word_count;
And the output is as shown below:
One comment