This post aims to let you know how to merge the relations in Apache Pig.
The UNION operator aids in the merger of two or more relations. The sole requirement for merging is that both the columns and domains of the relation be identical. The duplicate records will not be eliminated by this implementation.
Let’s see how to implement it.
Prerequisites:
1) Sample Data.
File Name: emp1.csv
“7839, KING, PRESIDENT, 0,17/Nov/1981, 5000, 0, 10
“7698, BLAKE, MANAGER, 7839, 01/May/1981, 2850, 0, 30
“7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450, 0, 10
“7566, JONES, MANAGER, 7839, 04/Feb/1981, 2975, 0, 20
“7788, SCOTT, ANALYST, 7566, 13/Jul/87, 3000, 0, 20
“7902, FORD, ANALYST, 7566, 03/Dec/1981, 3000, 0, 20
“7369, SMITH, CLERK, 7902, 17/Dec/1980, 800, 0, 20
File Name: emp2.csv
“7499, ALLEN, SALESMAN, 7698, 20/Feb/1981, 1600, 300, 30
“7521, WARD, SALESMAN, 7698, 22/Feb/1981, 1250, 500, 30
“7654, MARTIN, SALESMAN, 7698, 28/Sep/1981, 1250, 1400, 30
“7844, TURNER, SALESMAN, 7698, 09/Aug/1981, 1500, 0, 30
“7876, ADAMS, CLERK, 7788, 13/Jul/87, 1100, 0, 20
“7900, JAMES, CLERK, 7698, 03/Dec/1981, 950, 0, 30
“7934, MILLER, CLERK, 7782, 23/Jan/1982, 1300, 0, 10
2) Loading the data into a relation.
Execute the below in a single line to avoid each line execution.
emp1 = LOAD ‘Desktop/Docs/emp.csv’ USING PigStorage(‘,’) AS
“““( empno:int,
“““`ename:chararray,
“““`job:chararray,
“““`mgr:int,
“““`hiredate:chararray,
“““`sal:double,
“““`comm:double,
“““`deptno:int);
emp2 = LOAD ‘Desktop/Docs/emp.csv’ USING PigStorage(‘,’) AS
“““`( empno:int,
“““`ename:chararray,
“““`job:chararray,
“““`mgr:int,
“““`hiredate:chararray,
“““`sal:double,
“““`comm:double,
“““`deptno:int);
Joining Example:
mergedata = UNION emp1, emp2;
Dump mergedata;
Result:
“““`7839, KING, PRESIDENT, 0,17/Nov/1981, 5000, 0, 10
“““`7698, BLAKE, MANAGER, 7839, 01/May/1981, 2850, 0, 30
“““`7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450, 0, 10
“““`7566, JONES, MANAGER, 7839, 04/Feb/1981, 2975, 0, 20
“““`7788, SCOTT, ANALYST, 7566, 13/Jul/87, 3000, 0, 20
“““`7902, FORD, ANALYST, 7566, 03/Dec/1981, 3000, 0, 20
“““`7369, SMITH, CLERK, 7902, 17/Dec/1980, 800, 0, 20
“““`7499, ALLEN, SALESMAN, 7698, 20/Feb/1981, 1600, 300, 30
“““`7521, WARD, SALESMAN, 7698, 22/Feb/1981, 1250, 500, 30
“““`7654, MARTIN, SALESMAN, 7698, 28/Sep/1981, 1250, 1400, 30
“““`7844, TURNER, SALESMAN, 7698, 09/Aug/1981, 1500, 0, 30
“““`7876, ADAMS, CLERK, 7788, 13/Jul/87, 1100, 0, 20
“““`7900, JAMES, CLERK, 7698, 03/Dec/1981, 950, 0, 30
“““`7934, MILLER, CLERK, 7782, 23/Jan/1982, 1300, 0, 10
The above command is equal to:
SELECT * FROM emp1 UNION SELECT * FROM emp2;
One comment