Filtering Data in Apache Pig

The goal of this post is to show you how to retrieve filtered data from a relation (table) in Apache Pig. Apache Pig has a relational operator called FILTER that can assist you achieve your aim.

The FILTER operator selects tuples from a relation based on some condition. It is equivalent to the SQL “WHERE” clause, “WHERE..AND” and “WHERE..OR..” clauses.

Syntax:
alias = FILTER alias  BY expression;
Here, alias is the name of the relation.

Prerequisites:
1) Sample Data.
    File Name: emp.csv
7839, KING, PRESIDENT, 0,17/Nov/1981, 5000, 0, 10
7698, BLAKE, MANAGER, 7839, 01/May/1981, 2850, 0, 30
7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450, 0, 10
7566, JONES, MANAGER, 7839, 04/Feb/1981, 2975, 0, 20
7788, SCOTT, ANALYST, 7566, 13/Jul/87, 3000, 0, 20
7902, FORD, ANALYST, 7566, 03/Dec/1981, 3000, 0, 20
7369, SMITH, CLERK, 7902, 17/Dec/1980, 800, 0, 20
7499, ALLEN, SALESMAN, 7698, 20/Feb/1981, 1600, 300, 30
7521, WARD, SALESMAN, 7698, 22/Feb/1981, 1250, 500, 30
7654, MARTIN, SALESMAN, 7698, 28/Sep/1981, 1250, 1400, 30
7844, TURNER, SALESMAN, 7698, 09/Aug/1981, 1500, 0, 30
7876, ADAMS, CLERK, 7788, 13/Jul/87, 1100, 0, 20
7900, JAMES, CLERK, 7698, 03/Dec/1981, 950, 0, 30
7934, MILLER, CLERK, 7782, 23/Jan/1982, 1300, 0, 10

2) Loading the data into a relation.
Execute the below in a single line to avoid each line execution.
data = LOAD ‘Desktop/Docs/emp.csv’ USING PigStorage(‘,’) AS
““` (empno:int,
“““ename:chararray,
“““job:chararray,
“““mgr:int,
“““hiredate:chararray,
“““sal:double,
“““comm:double,
“““deptno:int);

Example-1:
result = FILTER data BY (deptno == 10);
Dump result;

       Result:
“““(7839,KING,PRESIDENT,0,17/Nov/1981,5000.0,0.0,10)
“““(7782,CLARK,MANAGER,7839,06/Sep/1981,2450.0,0.0,10)
“““(7934,MILLER,CLERK,7782,23/Jan/1982,1300.0,0.0,10)

The above command is equal to: 

       SELECT * FROM emp WHERE deptno = 10;

Example-2:
Multiple Conditions/Clauses:
result = FILTER data BY (deptno == 10) AND (sal > 2000);
Dump result;

“`Result:

“““(7839, KING, PRESIDENT, 0, 17/Nov/1981, 5000.0, 0.0, 10)
“““(7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450.0, 0.0, 10)

“`The above command is equal to: 
“`SELECT * FROM emp WHERE deptno = 10 AND salary > 2000;

Example-3:
Multiple Conditions/Clauses:
result = FILTER data BY (deptno == 10) OR (deptno == 40);
Dump result;


       Result:
“““(7839, KING, PRESIDENT, 0, 17/Nov/1981, 5000.0, 0.0, 10)
“““(7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450.0, 0.0, 10)
“““(7934, MILLER, CLERK, 7782, 23/Jan/1982, 1300.0, 0.0, 10)

“`The above command is equal to: 

“`SELECT * FROM emp WHERE (deptno = 10 OR deptno=40);

Hope you find this article helpful.

Please do follow for more interesting updates.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s