Query: SQL vs Apache Pig

SQL Queries are used to get data from tables by using SELECT statements with suitable filters and sorting. In this post, I’ll attempt to illustrate how to build a query in Apache Pig in a manner comparable to how SQL queries extract data.

Take a look at the sample below, which summarizes salaries by department, in SQL.

SELECT Dname, SUM(Sal) TotalSal FROM emp e
JOIN dept d ON e.deptno = d.deptno
WHERE Job IN(‘Manager’, ‘Analyst’)
GROUP BY Dname
HAVING SUM(Sal)> 2500
ORDER BY TotalSal DESC

ApachePig_SQLQueryEquivalentinPig2

Now, let’s do the same exercise using Apache Pig.

Since Apache Pig has a multi-query method that minimizes code length, the preceding code should be written in the form shown below.

Step1: Load the data into relations.
Please keep in mind that the script has been formatted to make it easier to read. To avoid each line execution, make sure it’s executed in a single line.

dept = LOAD ‘Desktop/Docs/dept.csv’ USING PigStorage(‘,’) as

““`(deptno:int,
““`dname:chararray,
““`loc:chararray);

emp = LOAD ‘Desktop/Docs/emp.csv’ USING PigStorage(‘,’) as
““`(empno:int,
““`ename:chararray,
““`job:chararray,
““`mgr:int,
““`hiredate:chararray,
““`sal:double,
““`comm:double,
““`deptno:int);

Step2: Filter the employee data to only include the information that is necessary.
empfilter = FILTER emp BY (job == ‘ANALYST’) OR (job== ‘MANAGER’);

Step3: Now JOIN the ’employee’ and ‘department’ relations.
joindata= JOIN empfilter BY deptno, dept BY deptno;

Step4: Group the data.
grouped = GROUP joindata BY dname;

Step5: The script below will retrieve more columns than are required, as well as summary salaries by department; however, it is just intended to demonstrate how to retrieve extra columns.
sumdata= FOREACH grouped GENERATE (
““““joindata.empno,
““““joindata.ename,
““““joindata.job,
““““joindata.mgr,
““““joindata.hiredate,
““““joindata.sal,
““““joindata.comm,
““““joindata.dname),
““““SUM(joindata.sal) as sumtotal;

Step6: the WHERE clause

chkdata= FILTER sumdata BY sumtotal > 2500.0;

Step7: Sorting
sorted = ORDER chkdata BY sumtotal DESC;

Step8: finally, the SELECT statement
DUMP sorted;

Please note that by combining the GROUP, FILTER, and FOREACH conditions, the number of steps can be decreased. We’ll go over that in more detail in the next article.

ApachePig_SQLQueryEquivalentinPig0

Result:

ApachePig_SQLQueryEquivalentinPig


I hope you found this post to be informative. Please continue to follow us for more interesting information.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s