The goal of this post is to show you how to retrieve filtered data from a relation (table) in Apache Pig. Apache Pig has a relational operator called FILTER that can assist you achieve your aim.
The FILTER operator selects tuples from a relation based on some condition. It is equivalent to the SQL “WHERE” clause, “WHERE..AND” and “WHERE..OR..” clauses.
Syntax:
alias = FILTER alias BY expression;
Here, alias is the name of the relation.
Prerequisites:
1) Sample Data.
File Name: emp.csv
“7839, KING, PRESIDENT, 0,17/Nov/1981, 5000, 0, 10
“7698, BLAKE, MANAGER, 7839, 01/May/1981, 2850, 0, 30
“7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450, 0, 10
“7566, JONES, MANAGER, 7839, 04/Feb/1981, 2975, 0, 20
“7788, SCOTT, ANALYST, 7566, 13/Jul/87, 3000, 0, 20
“7902, FORD, ANALYST, 7566, 03/Dec/1981, 3000, 0, 20
“7369, SMITH, CLERK, 7902, 17/Dec/1980, 800, 0, 20
“7499, ALLEN, SALESMAN, 7698, 20/Feb/1981, 1600, 300, 30
“7521, WARD, SALESMAN, 7698, 22/Feb/1981, 1250, 500, 30
“7654, MARTIN, SALESMAN, 7698, 28/Sep/1981, 1250, 1400, 30
“7844, TURNER, SALESMAN, 7698, 09/Aug/1981, 1500, 0, 30
“7876, ADAMS, CLERK, 7788, 13/Jul/87, 1100, 0, 20
“7900, JAMES, CLERK, 7698, 03/Dec/1981, 950, 0, 30
“7934, MILLER, CLERK, 7782, 23/Jan/1982, 1300, 0, 10
2) Loading the data into a relation.
Execute the below in a single line to avoid each line execution.
data = LOAD ‘Desktop/Docs/emp.csv’ USING PigStorage(‘,’) AS
““` (empno:int,
“““ename:chararray,
“““job:chararray,
“““mgr:int,
“““hiredate:chararray,
“““sal:double,
“““comm:double,
“““deptno:int);
Example-1:
result = FILTER data BY (deptno == 10);
Dump result;
Result:
“““(7839,KING,PRESIDENT,0,17/Nov/1981,5000.0,0.0,10)
“““(7782,CLARK,MANAGER,7839,06/Sep/1981,2450.0,0.0,10)
“““(7934,MILLER,CLERK,7782,23/Jan/1982,1300.0,0.0,10)
The above command is equal to:
SELECT * FROM emp WHERE deptno = 10;
Example-2:
Multiple Conditions/Clauses:
result = FILTER data BY (deptno == 10) AND (sal > 2000);
Dump result;
“`Result:
“““(7839, KING, PRESIDENT, 0, 17/Nov/1981, 5000.0, 0.0, 10)
“““(7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450.0, 0.0, 10)
“`The above command is equal to:
“`SELECT * FROM emp WHERE deptno = 10 AND salary > 2000;
Example-3:
Multiple Conditions/Clauses:
result = FILTER data BY (deptno == 10) OR (deptno == 40);
Dump result;
Result:
“““(7839, KING, PRESIDENT, 0, 17/Nov/1981, 5000.0, 0.0, 10)
“““(7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450.0, 0.0, 10)
“““(7934, MILLER, CLERK, 7782, 23/Jan/1982, 1300.0, 0.0, 10)
“`The above command is equal to:
“`SELECT * FROM emp WHERE (deptno = 10 OR deptno=40);
Hope you find this article helpful.
Please do follow for more interesting updates.
One comment