Grouping in Apache Pig

We’ll look at how to group the relation (table) data with Apache Pig in this post. In data analysis, sorting or grouping the data is extremely common.

If you’re acquainted with SQL, you’ll recognize the output of the command below.
SELECT deptno, SUM(sal) FROM emp GROUP BY deptno;

This query gives the total amount paid to each department’s employees.

Let’s put it all together in Apache Pig.

Step1: Loading the data into “emp” relation.
emp = LOAD ‘Desktop/Docs/emp.csv’ USING PigStorage(‘,’) as (empno:int, ename:chararray, job:chararray, mgr:int, hiredate:chararray, sal:double, comm:double, deptno:int);

Step2: Grouping the data

a = GROUP emp BY deptno;

Step3: Summing up the data.
result = foreach a generate group, SUM(emp.sal);

Step4: Returning the result on screen
Dump result;

Hope you find this article helpful.

Please join our mailing list to receive more interesting information.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s