WORD COUNT IN Apache Pig

Pig is a high-level scripting language used in Apache Hadoop. Pig allows to write dynamic data transformations without the awareness of Java. Pig’s basic SQL-like scripting language is called Pig Latin and appeals to developers already acquainted with SQL and scripting languages. The below is a sample word-count program to let you know how to work with Pig.

Dataset:
forWordCount.txt
“This is not a text data but a test data to test word count program. “

Code:
input_lines = LOAD ‘/user/cloudera/forWordCount.txt’ AS (line:chararray);
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES ‘\\w+’;
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
DUMP ordered_word_count;

WordCount_Pig

And the output is as shown below:

WordCount_Pig_result

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s