COGROUP in Apache Pig

COGROUP works in the same way as the group operator. The major distinction between the Group and Cogroup operators is that the Group operator is often used with a single relation, whereas the Cogroup operator is typically used with several relations.

Apache Pig’s functions appear to be similar to SQL’s, however they differ somewhat. There is no similar operator in SQL for COGROUP.

Let’s see how to implement it.

Prerequisites:
1) Sample Data.
    File Name: emp.csv
7839, KING, PRESIDENT, 0,17/Nov/1981, 5000, 0, 10
7698, BLAKE, MANAGER, 7839, 01/May/1981, 2850, 0, 30
7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450, 0, 10
7566, JONES, MANAGER, 7839, 04/Feb/1981, 2975, 0, 20
7788, SCOTT, ANALYST, 7566, 13/Jul/87, 3000, 0, 20
7902, FORD, ANALYST, 7566, 03/Dec/1981, 3000, 0, 20
7369, SMITH, CLERK, 7902, 17/Dec/1980, 800, 0, 20
7499, ALLEN, SALESMAN, 7698, 20/Feb/1981, 1600, 300, 30
7521, WARD, SALESMAN, 7698, 22/Feb/1981, 1250, 500, 30
7654, MARTIN, SALESMAN, 7698, 28/Sep/1981, 1250, 1400, 30
7844, TURNER, SALESMAN, 7698, 09/Aug/1981, 1500, 0, 30
7876, ADAMS, CLERK, 7788, 13/Jul/87, 1100, 0, 20
7900, JAMES, CLERK, 7698, 03/Dec/1981, 950, 0, 30
7934, MILLER, CLERK, 7782, 23/Jan/1982, 1300, 0, 10

    File Name: dept.csv
10, ACCOUNTING, NEW YORK,
20, RESEARCH, DALLAS,
30, SALES, CHICAGO,
40, OPERATIONS, BOSTON;

2) Loading the data into a relation.
Execute the below in a single line to avoid each line execution.
emp = LOAD ‘Desktop/Docs/emp.csv’ USING PigStorage(‘,’) AS
( empno:int,
ename:chararray,
job:chararray,
mgr:int,
hiredate:chararray,
sal:double,
comm:double,
deptno:int);

dept = LOAD ‘Desktop/Docs/dept.csv’ USING PigStorage(‘,’) AS
( deptno:int,
dname:chararray,
loc:chararray);

Example:
CoGroupedData = COGROUP emp BY deptno, dept BY deptno;
Dump CoGroupedData;


Result:

(10,{(7839,KING,PRESIDENT,0,17/Nov/1981,5000.0,0.0,10),(7934,MILLER,CLERK,7782,23/Jan/1982,1300.0,0.0,10),(7782,CLARK,MANAGER,7839,06/Sep/1981,2450.0,0.0,10)},{(10,ACCOUNTING,NEW YORK)})

(20,{(7902,FORD,ANALYST,7566,03/Dec/1981,3000.0,0.0,20),(7788,SCOTT,ANALYST,7566,13/Jul/87,3000.0,0.0,20),(7566,JONES,MANAGER,7839,04/Feb/1981,2975.0,0.0,20),(7369,SMITH,CLERK,7902,17/Dec/1980,800.0,0.0,20),(7876,ADAMS,CLERK,7788,13/Jul/87,1100.0,0.0,20)},{(20,RESEARCH,DALLAS)})

(30,{(7844,TURNER,SALESMAN,7698,09/Aug/1981,1500.0,0.0,30),(7654,MARTIN,SALESMAN,7698,28/Sep/1981,1250.0,1400.0,30),(7521,WARD,SALESMAN,7698,22/Feb/1981,1250.0,500.0,30),(7499,ALLEN,SALESMAN,7698,20/Feb/1981,1600.0,300.0,30),(7698,BLAKE,MANAGER,7839,01/May/1981,2850.0,0.0,30),(7900,JAMES,CLERK,7698,03/Dec/1981,950.0,0.0,30)},{(30,SALES,CHICAGO)})

(40,{},{(40,OPERATIONS,BOSTON)})

The above dataset is organized by department number, which exists in both tables. Since we are grouping two relations, two bags are generated for each department. The first bag includes all of the tuples from the first relation, “emp,” and the second bag contains all of the tuples from the second relation, “dept.” To distinguish the bags, the output has been colored. For the department number ’40’, there are no associated entries in “emp” hence it resulted empty bag.

Hope you find this post helpful.

Please subscribe for more interesting updates.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s