COGROUP in Apache Pig

COGROUP works in the same way as the group operator. The major distinction between the Group and Cogroup operators is that the Group operator is often used with a single relation, whereas the Cogroup operator is typically used with several relations.

Apache Pig’s functions appear to be similar to SQL’s, however they differ somewhat. There is no similar operator in SQL for COGROUP.

Let’s see how to implement it.

Prerequisites:
1) Sample Data.
File Name: emp.csv
“7839, KING, PRESIDENT, 0,17/Nov/1981, 5000, 0, 10
“7698, BLAKE, MANAGER, 7839, 01/May/1981, 2850, 0, 30
“7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450, 0, 10
“7566, JONES, MANAGER, 7839, 04/Feb/1981, 2975, 0, 20
“7788, SCOTT, ANALYST, 7566, 13/Jul/87, 3000, 0, 20
“7902, FORD, ANALYST, 7566, 03/Dec/1981, 3000, 0, 20
“7369, SMITH, CLERK, 7902, 17/Dec/1980, 800, 0, 20
“7499, ALLEN, SALESMAN, 7698, 20/Feb/1981, 1600, 300, 30
“7521, WARD, SALESMAN, 7698, 22/Feb/1981, 1250, 500, 30
“7654, MARTIN, SALESMAN, 7698, 28/Sep/1981, 1250, 1400, 30
“7844, TURNER, SALESMAN, 7698, 09/Aug/1981, 1500, 0, 30
“7876, ADAMS, CLERK, 7788, 13/Jul/87, 1100, 0, 20
“7900, JAMES, CLERK, 7698, 03/Dec/1981, 950, 0, 30
“7934, MILLER, CLERK, 7782, 23/Jan/1982, 1300, 0, 10

File Name: dept.csv
“10, ACCOUNTING, NEW YORK,
“20, RESEARCH, DALLAS,
“30, SALES, CHICAGO,
“40, OPERATIONS, BOSTON;

2) Loading the data into a relation.
“Execute the below in a single line to avoid each line execution.
emp = LOAD ‘Desktop/Docs/emp.csv’ USING PigStorage(‘,’) AS
“““( empno:int,
“““ename:chararray,
“““job:chararray,
“““mgr:int,
“““hiredate:chararray,
“““sal:double,
“““comm:double,
“““deptno:int);

dept = LOAD ‘Desktop/Docs/dept.csv’ USING PigStorage(‘,’) AS
“““( deptno:int,
“““dname:chararray,
“““loc:chararray);

Example:
CoGroupedData = COGROUP emp BY deptno, dept BY deptno;
Dump CoGroupedData;

Result:
(10,{(7839,KING,PRESIDENT,0,17/Nov/1981,5000.0,0.0,10),(7934,MILLER,CLERK,7782,23/Jan/1982,1300.0,0.0,10),(7782,CLARK,MANAGER,7839,06/Sep/1981,2450.0,0.0,10)},{(10,ACCOUNTING,NEW YORK)})

(20,{(7902,FORD,ANALYST,7566,03/Dec/1981,3000.0,0.0,20),(7788,SCOTT,ANALYST,7566,13/Jul/87,3000.0,0.0,20),(7566,JONES,MANAGER,7839,04/Feb/1981,2975.0,0.0,20),(7369,SMITH,CLERK,7902,17/Dec/1980,800.0,0.0,20),(7876,ADAMS,CLERK,7788,13/Jul/87,1100.0,0.0,20)},{(20,RESEARCH,DALLAS)})

(30,{(7844,TURNER,SALESMAN,7698,09/Aug/1981,1500.0,0.0,30),(7654,MARTIN,SALESMAN,7698,28/Sep/1981,1250.0,1400.0,30),(7521,WARD,SALESMAN,7698,22/Feb/1981,1250.0,500.0,30),(7499,ALLEN,SALESMAN,7698,20/Feb/1981,1600.0,300.0,30),(7698,BLAKE,MANAGER,7839,01/May/1981,2850.0,0.0,30),(7900,JAMES,CLERK,7698,03/Dec/1981,950.0,0.0,30)},{(30,SALES,CHICAGO)})

(40,{},{(40,OPERATIONS,BOSTON)})

The above dataset is organized by department number, which exists in both tables. Since we are grouping two relations, two bags are generated for each department. The first bag includes all of the tuples from the first relation, “emp,” and the second bag contains all of the tuples from the second relation, “dept.” To distinguish the bags, the output has been colored. For the department number ’40’, there are no associated entries in “emp” hence it resulted empty bag.

Hope you find this post helpful.

Please subscribe for more interesting updates.