Although Apache Pig shares some similarities with SQL, it differs in functionality and behavior. Most SQL features, including as sorting, grouping, and joining, may be done in Apache Pig, albeit in a somewhat different manner. In this post, we’ll look at how to connect the two relations (tables) to accomplish self-join.
What is a self-join?
As the name indicates, the self join joins a table to itself. However, the table must have two columns in order to be joined to itself.
Implementation in Apache Pig:
Load the data into two relations instead of one and join the relations. This solution is similar to SQL; with SQL, we create two different aliases for the same table such that both aliases act as two separate tables. Let’s see how it is to be done.
Prerequisites:
1) Sample Data.
File Name: emp.csv
“7839, KING, PRESIDENT, 0,17/Nov/1981, 5000, 0, 10
“7698, BLAKE, MANAGER, 7839, 01/May/1981, 2850, 0, 30
“7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450, 0, 10
“7566, JONES, MANAGER, 7839, 04/Feb/1981, 2975, 0, 20
“7788, SCOTT, ANALYST, 7566, 13/Jul/87, 3000, 0, 20
“7902, FORD, ANALYST, 7566, 03/Dec/1981, 3000, 0, 20
“7369, SMITH, CLERK, 7902, 17/Dec/1980, 800, 0, 20
“7499, ALLEN, SALESMAN, 7698, 20/Feb/1981, 1600, 300, 30
“7521, WARD, SALESMAN, 7698, 22/Feb/1981, 1250, 500, 30
“7654, MARTIN, SALESMAN, 7698, 28/Sep/1981, 1250, 1400, 30
“7844, TURNER, SALESMAN, 7698, 09/Aug/1981, 1500, 0, 30
“7876, ADAMS, CLERK, 7788, 13/Jul/87, 1100, 0, 20
“7900, JAMES, CLERK, 7698, 03/Dec/1981, 950, 0, 30
“7934, MILLER, CLERK, 7782, 23/Jan/1982, 1300, 0, 10
2) Loading the data into a relation.
Execute the below in a single line to avoid each line execution.
emp1 = LOAD ‘Desktop/Docs/emp.csv’ USING PigStorage(‘,’) AS
“““( empno:int,
“““`ename:chararray,
“““`job:chararray,
“““`mgr:int,
“““`hiredate:chararray,
“““`sal:double,
“““`comm:double,
“““`deptno:int);
emp2 = LOAD ‘Desktop/Docs/emp.csv’ USING PigStorage(‘,’) AS
“““`( empno:int,
“““`ename:chararray,
“““`job:chararray,
“““`mgr:int,
“““`hiredate:chararray,
“““`sal:double,
“““`comm:double,
“““`deptno:int);
Joining Example:
joined= JOIN emp1 BY mgr, emp2 BY empno;
Dump joined;
Result:
“`All the matched records from both the relations/tables.
The above command is equal to:
SELECT * FROM emp a JOIN emp b ON a.mgr = b.empno;
Hope you find this article helpful.
Please do follow us for more interesting updates.
One comment