To avoid erroneous data values, it’s common to look for and eliminate duplicate entries during data analysis. This article will teach you how to remove duplicate tuples from a relation in Apache Pig.
Prerequisites:
1) Sample Data.
File Name: emp.csv
“7839, KING, PRESIDENT, 0,17/Nov/1981, 5000, 0, 10
“7698, BLAKE, MANAGER, 7839, 01/May/1981, 2850, 0, 30
“7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450, 0, 10
“7566, JONES, MANAGER, 7839, 04/Feb/1981, 2975, 0, 20
“7788, SCOTT, ANALYST, 7566, 13/Jul/87, 3000, 0, 20
“7839, KING, PRESIDENT, 0,17/Nov/1981, 5000, 0, 10
“7698, BLAKE, MANAGER, 7839, 01/May/1981, 2850, 0, 30
“7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450, 0, 10
“7566, JONES, MANAGER, 7839, 04/Feb/1981, 2975, 0, 20
“7788, SCOTT, ANALYST, 7566, 13/Jul/87, 3000, 0, 20
2) Loading the data into a relation.
“Execute the below in a single line to avoid each line execution.
emp = LOAD ‘Desktop/Docs/emp.csv’ USING PigStorage(‘,’) AS
“““( empno:int,
“““ename:chararray,
“““job:chararray,
“““mgr:int,
“““hiredate:chararray,
“““sal:double,
“““comm:double,
“““deptno:int);
Example:
Distinct_emp = DISTINCT emp;
Dump Distinct_emp;
The above command helps in the removal of redundant tuples from the relation. This filtering will create a new relation name “distinct_data”
Result:
“““7839, KING, PRESIDENT, 0,17/Nov/1981, 5000, 0, 10
“““7698, BLAKE, MANAGER, 7839, 01/May/1981, 2850, 0, 30
“““7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450, 0, 10
“““7566, JONES, MANAGER, 7839, 04/Feb/1981, 2975, 0, 20
“““7788, SCOTT, ANALYST, 7566, 13/Jul/87, 3000, 0, 20
The above command is equal to:
SELECT DISTINCT * FROM emp;
One comment