Joining in Apache Pig is identical to joining in SQL, in that it is used to combine tuples (rows) from two or more relations (tables) based on a common column. This post will let you know how to join multiple relations in Apache Pig.
Prerequisites:
1) Sample Data.
File Name: emp.csv
“7839, KING, PRESIDENT, 0,17/Nov/1981, 5000, 0, 10
“7698, BLAKE, MANAGER, 7839, 01/May/1981, 2850, 0, 30
“7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450, 0, 10
“7566, JONES, MANAGER, 7839, 04/Feb/1981, 2975, 0, 20
“7788, SCOTT, ANALYST, 7566, 13/Jul/87, 3000, 0, 20
“7902, FORD, ANALYST, 7566, 03/Dec/1981, 3000, 0, 20
“7369, SMITH, CLERK, 7902, 17/Dec/1980, 800, 0, 20
“7499, ALLEN, SALESMAN, 7698, 20/Feb/1981, 1600, 300, 30
“7521, WARD, SALESMAN, 7698, 22/Feb/1981, 1250, 500, 30
“7654, MARTIN, SALESMAN, 7698, 28/Sep/1981, 1250, 1400, 30
“7844, TURNER, SALESMAN, 7698, 09/Aug/1981, 1500, 0, 30
“7876, ADAMS, CLERK, 7788, 13/Jul/87, 1100, 0, 20
“7900, JAMES, CLERK, 7698, 03/Dec/1981, 950, 0, 30
“7934, MILLER, CLERK, 7782, 23/Jan/1982, 1300, 0, 10
File Name: dept.csv
“10, ACCOUNTING, NEW YORK,
“20, RESEARCH, DALLAS,
“30, SALES, CHICAGO,
“40, OPERATIONS, BOSTON;
2) Loading the data into a relation.
“Execute the below in a single line to avoid each line execution.
emp = LOAD ‘Desktop/Docs/emp.csv’ USING PigStorage(‘,’) AS
“““( empno:int,
“““ename:chararray,
“““job:chararray,
“““mgr:int,
“““hiredate:chararray,
“““sal:double,
“““comm:double,
“““deptno:int);
dept = LOAD ‘Desktop/Docs/dept.csv’ USING PigStorage(‘,’) AS
“““( deptno:int,
“““dname:chararray,
“““loc:chararray);
Joining Example:
joined= JOIN emp BY deptno, dept BY deptno;
Dump Joined;
Result:
“““All the rows from “emp” and “dept”.
“`The above command is equal to:
SELECT * FROM emp JOIN dept ON emp.deptno = dept.deptno;
Hope you find this article helpful.
Please follow us for more interesting updates.
One comment