Removing Redundant Tuples in Apache Pig

To avoid erroneous data values, it’s common to look for and eliminate duplicate entries during data analysis. This article will teach you how to remove duplicate tuples from a relation in Apache Pig.

Prerequisites:
1) Sample Data.
    File Name: emp.csv
7839, KING, PRESIDENT, 0,17/Nov/1981, 5000, 0, 10
7698, BLAKE, MANAGER, 7839, 01/May/1981, 2850, 0, 30
7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450, 0, 10
7566, JONES, MANAGER, 7839, 04/Feb/1981, 2975, 0, 20
7788, SCOTT, ANALYST, 7566, 13/Jul/87, 3000, 0, 20
7839, KING, PRESIDENT, 0,17/Nov/1981, 5000, 0, 10
7698, BLAKE, MANAGER, 7839, 01/May/1981, 2850, 0, 30
7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450, 0, 10
7566, JONES, MANAGER, 7839, 04/Feb/1981, 2975, 0, 20
7788, SCOTT, ANALYST, 7566, 13/Jul/87, 3000, 0, 20

2) Loading the data into a relation.
Execute the below in a single line to avoid each line execution.
emp = LOAD ‘Desktop/Docs/emp.csv’ USING PigStorage(‘,’) AS
( empno:int,
ename:chararray,
job:chararray,
mgr:int,
hiredate:chararray,
sal:double,
comm:double,
deptno:int);

Example:
Distinct_emp = DISTINCT emp;
Dump Distinct_emp;
The above command helps in the removal of redundant tuples from the relation. This filtering will create a new relation name “distinct_data”

       Result:
7839, KING, PRESIDENT, 0,17/Nov/1981, 5000, 0, 10
7698, BLAKE, MANAGER, 7839, 01/May/1981, 2850, 0, 30
7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450, 0, 10
7566, JONES, MANAGER, 7839, 04/Feb/1981, 2975, 0, 20
7788, SCOTT, ANALYST, 7566, 13/Jul/87, 3000, 0, 20

The above command is equal to: 
       SELECT DISTINCT * FROM emp;

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s