Data sampling is the best practice to understand the data patterns and trends of large datasets by looking at the smaller portion of the data. In RDBMS systems, the LIMIT operator is used to do this, and the same is true in Apache Pig. Let’s look at how to get N tuples (rows in SQL) from a relation (table in SQL).
Prerequisites:
1) Sample Data.
File Name: emp.csv
“7839, KING, PRESIDENT, 0,17/Nov/1981, 5000, 0, 10
“7698, BLAKE, MANAGER, 7839, 01/May/1981, 2850, 0, 30
“7782, CLARK, MANAGER, 7839, 06/Sep/1981, 2450, 0, 10
“7566, JONES, MANAGER, 7839, 04/Feb/1981, 2975, 0, 20
“7788, SCOTT, ANALYST, 7566, 13/Jul/87, 3000, 0, 20
“7902, FORD, ANALYST, 7566, 03/Dec/1981, 3000, 0, 20
“7369, SMITH, CLERK, 7902, 17/Dec/1980, 800, 0, 20
“7499, ALLEN, SALESMAN, 7698, 20/Feb/1981, 1600, 300, 30
“7521, WARD, SALESMAN, 7698, 22/Feb/1981, 1250, 500, 30
“7654, MARTIN, SALESMAN, 7698, 28/Sep/1981, 1250, 1400, 30
“7844, TURNER, SALESMAN, 7698, 09/Aug/1981, 1500, 0, 30
“7876, ADAMS, CLERK, 7788, 13/Jul/87, 1100, 0, 20
“7900, JAMES, CLERK, 7698, 03/Dec/1981, 950, 0, 30
“7934, MILLER, CLERK, 7782, 23/Jan/1982, 1300, 0, 10
2) Loading the data into a relation.
“Execute the below in a single line to avoid each line execution.
data = LOAD ‘Desktop/Docs/emp.csv’ USING PigStorage(‘,’) AS
“““(empno:int,
“““ename:chararray,
“““job:chararray,
“““mgr:int,
“““hiredate:chararray,
“““sal:double,
“““comm:double,
“““deptno:int);
Example:
data_limit = LIMIT data 4;
Dump data_limit;
Result:
“““(7566,JONES,MANAGER,7839,04/Feb/1981,2975.0,0.0,20)
“““(7698,BLAKE,MANAGER,7839,01/May/1981,2850.0,0.0,30)
“““(7782,CLARK,MANAGER,7839,06/Sep/1981,2450.0,0.0,10)
“““(7839,KING,PRESIDENT,0,17/Nov/1981,5000.0,0.0,10)
The above command is equal to:
“`SELECT * FROM emp LIMIT 4;
“`SELECT TOP 4 * FROM emp;
Hope you find this article helpful.
Please follow us for more interesting updates.
One comment