Merging Multiple Relations Data – Apache Pig

This post aims to let you know how to merge the relations in Apache Pig.

The UNION operator aids in the merger of two or more relations. The sole requirement for merging is that both the columns and domains of the relation be identical. The duplicate records will not be eliminated by this implementation.

Let’s see how to implement it.

Prerequisites:
1) Sample Data.
    File Name: emp1.csv
7839, KING, PRESIDENT, 0,17/Nov/1981, 5000, 0, 10
7698, BLAKE, MANAGER, 7839, 01/May/1981, 2850, 0, 30
7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450, 0, 10
7566, JONES, MANAGER, 7839, 04/Feb/1981, 2975, 0, 20
7788, SCOTT, ANALYST, 7566, 13/Jul/87, 3000, 0, 20
7902, FORD, ANALYST, 7566, 03/Dec/1981, 3000, 0, 20
7369, SMITH, CLERK, 7902, 17/Dec/1980, 800, 0, 20
    File Name: emp2.csv
7499, ALLEN, SALESMAN, 7698, 20/Feb/1981, 1600, 300, 30
7521, WARD, SALESMAN, 7698, 22/Feb/1981, 1250, 500, 30
7654, MARTIN, SALESMAN, 7698, 28/Sep/1981, 1250, 1400, 30
7844, TURNER, SALESMAN, 7698, 09/Aug/1981, 1500, 0, 30
7876, ADAMS, CLERK, 7788, 13/Jul/87, 1100, 0, 20
7900, JAMES, CLERK, 7698, 03/Dec/1981, 950, 0, 30
7934, MILLER, CLERK, 7782, 23/Jan/1982, 1300, 0, 10

2) Loading the data into a relation.
Execute the below in a single line to avoid each line execution.

emp1 = LOAD ‘Desktop/Docs/emp.csv’ USING PigStorage(‘,’) AS
( empno:int,
“`ename:chararray,
“`job:chararray,
“`mgr:int,
“`hiredate:chararray,
“`sal:double,
“`comm:double,
“`deptno:int);

emp2 = LOAD ‘Desktop/Docs/emp.csv’ USING PigStorage(‘,’) AS
“`( empno:int,
“`ename:chararray,
“`job:chararray,
“`mgr:int,
“`hiredate:chararray,
“`sal:double,
“`comm:double,
“`deptno:int);

Joining Example:
mergedata = UNION emp1, emp2;
Dump mergedata;

       Result:
“`7839, KING, PRESIDENT, 0,17/Nov/1981, 5000, 0, 10
“`7698, BLAKE, MANAGER, 7839, 01/May/1981, 2850, 0, 30
“`7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450, 0, 10
“`7566, JONES, MANAGER, 7839, 04/Feb/1981, 2975, 0, 20
“`7788, SCOTT, ANALYST, 7566, 13/Jul/87, 3000, 0, 20
“`7902, FORD, ANALYST, 7566, 03/Dec/1981, 3000, 0, 20
“`7369, SMITH, CLERK, 7902, 17/Dec/1980, 800, 0, 20
“`7499, ALLEN, SALESMAN, 7698, 20/Feb/1981, 1600, 300, 30
“`7521, WARD, SALESMAN, 7698, 22/Feb/1981, 1250, 500, 30
“`7654, MARTIN, SALESMAN, 7698, 28/Sep/1981, 1250, 1400, 30
“`7844, TURNER, SALESMAN, 7698, 09/Aug/1981, 1500, 0, 30
“`7876, ADAMS, CLERK, 7788, 13/Jul/87, 1100, 0, 20
“`7900, JAMES, CLERK, 7698, 03/Dec/1981, 950, 0, 30
“`7934, MILLER, CLERK, 7782, 23/Jan/1982, 1300, 0, 10

The above command is equal to: 

       SELECT * FROM emp1 UNION SELECT * FROM emp2;

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s