SQL Queries are used to get data from tables by using SELECT statements with suitable filters and sorting. In this post, I’ll attempt to illustrate how to build a query in Apache Pig in a manner comparable to how SQL queries extract data.
Take a look at the sample below, which summarizes salaries by department, in SQL.
SELECT Dname, SUM(Sal) TotalSal FROM emp e
JOIN dept d ON e.deptno = d.deptno
WHERE Job IN(‘Manager’, ‘Analyst’)
GROUP BY Dname
HAVING SUM(Sal)> 2500
ORDER BY TotalSal DESC
Now, let’s do the same exercise using Apache Pig.
Since Apache Pig has a multi-query method that minimizes code length, the preceding code should be written in the form shown below.
Step1: Load the data into relations.
Please keep in mind that the script has been formatted to make it easier to read. To avoid each line execution, make sure it’s executed in a single line.
dept = LOAD ‘Desktop/Docs/dept.csv’ USING PigStorage(‘,’) as
““`(deptno:int,
““`dname:chararray,
““`loc:chararray);
emp = LOAD ‘Desktop/Docs/emp.csv’ USING PigStorage(‘,’) as
““`(empno:int,
““`ename:chararray,
““`job:chararray,
““`mgr:int,
““`hiredate:chararray,
““`sal:double,
““`comm:double,
““`deptno:int);
Step2: Filter the employee data to only include the information that is necessary.
empfilter = FILTER emp BY (job == ‘ANALYST’) OR (job== ‘MANAGER’);
Step3: Now JOIN the ’employee’ and ‘department’ relations.
joindata= JOIN empfilter BY deptno, dept BY deptno;
Step4: Group the data.
grouped = GROUP joindata BY dname;
Step5: The script below will retrieve more columns than are required, as well as summary salaries by department; however, it is just intended to demonstrate how to retrieve extra columns.
sumdata= FOREACH grouped GENERATE (
““““joindata.empno,
““““joindata.ename,
““““joindata.job,
““““joindata.mgr,
““““joindata.hiredate,
““““joindata.sal,
““““joindata.comm,
““““joindata.dname),
““““SUM(joindata.sal) as sumtotal;
Step6: the WHERE clause
chkdata= FILTER sumdata BY sumtotal > 2500.0;
Step7: Sorting
sorted = ORDER chkdata BY sumtotal DESC;
Step8: finally, the SELECT statement
DUMP sorted;
Please note that by combining the GROUP, FILTER, and FOREACH conditions, the number of steps can be decreased. We’ll go over that in more detail in the next article.
Result:
I hope you found this post to be informative. Please continue to follow us for more interesting information.
One comment