Outer Joins in Apache Pig

Joining in Apache Pig is identical to joining in SQL, in that it is used to combine tuples (rows) from two or more relations (tables) based on a common column. This post will let you know how to join multiple relations in Apache Pig.

Prerequisites:
1) Sample Data.
    File Name: emp.csv
7839, KING, PRESIDENT, 0,17/Nov/1981, 5000, 0, 10
7698, BLAKE, MANAGER, 7839, 01/May/1981, 2850, 0, 30
7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450, 0, 10
7566, JONES, MANAGER, 7839, 04/Feb/1981, 2975, 0, 20
7788, SCOTT, ANALYST, 7566, 13/Jul/87, 3000, 0, 20
7902, FORD, ANALYST, 7566, 03/Dec/1981, 3000, 0, 20
7369, SMITH, CLERK, 7902, 17/Dec/1980, 800, 0, 20
7499, ALLEN, SALESMAN, 7698, 20/Feb/1981, 1600, 300, 30
7521, WARD, SALESMAN, 7698, 22/Feb/1981, 1250, 500, 30
7654, MARTIN, SALESMAN, 7698, 28/Sep/1981, 1250, 1400, 30
7844, TURNER, SALESMAN, 7698, 09/Aug/1981, 1500, 0, 30
7876, ADAMS, CLERK, 7788, 13/Jul/87, 1100, 0, 20
7900, JAMES, CLERK, 7698, 03/Dec/1981, 950, 0, 30
7934, MILLER, CLERK, 7782, 23/Jan/1982, 1300, 0, 10

    File Name: dept.csv
10, ACCOUNTING, NEW YORK,
20, RESEARCH, DALLAS,
30, SALES, CHICAGO,
40, OPERATIONS, BOSTON;

2) Loading the data into a relation.
Execute the below in a single line to avoid each line execution.
emp = LOAD ‘Desktop/Docs/emp.csv’ USING PigStorage(‘,’) AS
( empno:int,
ename:chararray,
job:chararray,
mgr:int,
hiredate:chararray,
sal:double,
comm:double,
deptno:int);

dept = LOAD ‘Desktop/Docs/dept.csv’ USING PigStorage(‘,’) AS
( deptno:int,
dname:chararray,
loc:chararray);

Left Join:
This join retrieves all rows from the table on the left side of the join, as well as matching rows from the table on the right. The result-set will include null for the rows for which there is no matching row on the right side.

Example:
left-joined= JOIN emp BY deptno left outer, dept BY deptno;
Dump left-Joined;

       Result:
All the rows from “emp” and “dept”.

`The above command is equal to: 

       SELECT * FROM emp LEFT JOIN dept ON emp.deptno = dept.deptno;

Right Join:
This join retrieves all rows from the table on the right side of the join, as well as matching rows from the table on the left. The result-set will include null for the rows for which there is no matching row on the left side.

Example:
right-joined= JOIN emp BY deptno right outer, dept BY deptno;
Dump right-Joined;

       Result:
All the rows from “emp” and “dept”.

`The above command is equal to: 

SELECT * FROM emp RIGHT JOIN dept ON emp.deptno = dept.deptno;

Full Outer Join:
The FULL OUTER JOIN returns all records from both relations that are matched and unmatched. It combines the results of the Left Outer Join and the Right Outer Join, and NULL is assigned in the rows for the unmatched records.

Example:
full-outer-joined= JOIN emp BY deptno full outer, dept BY deptno;
Dump full-outer-Joined;

Result:
All the rows from “emp” and “dept”.

The above command is equal to: 
SELECT * FROM emp FULL OUTER JOIN dept ON emp.deptno = dept.deptno;

Hope you find this article helpful.

2 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s