Loading and Storing Data In Apache Pig

Apache Pig is a tool/platform for analyzing huge volumes of data as data flows. Pig is commonly used with Hadoop; we can use Apache Pig to execute any data manipulation functions in Hadoop. This article will explain how to load data from HDFS and the local file system. Also, how to export the analyzed/processed data to HDFS or the local file system.

The following command will help you in fetching the “emp” and “dept” data into Pig.

emp = LOAD ‘Desktop/Docs/emp.csv’
            USING PigStorage(‘,’)
            as (empno:int,
                  ename:chararray,
                  job:chararray,
                  mgr:int,
                  hiredate:chararray,
                  sal:double,
                  comm:double,
                  deptno:int);

dept = LOAD ‘Desktop/Docs/dept.csv’
             USING PigStorage(‘,’)
             as (deptno:int,
                   dname:chararray,
                   loc:chararray);

If you’ve observed, the data is being retrieved from the local file system. It’s because “pig -x local” is used to start the Pig in local mode. The path of the file should be HDFS if you connected in mapreduce mode (pig -x mapreduce). The code will change as shown below.

emp = LOAD ‘/user/cloudera/emp.csv’
            USING PigStorage(‘,’)
            as (empno:int,
                  ename:chararray,
                  job:chararray,
                  mgr:int,
                  hiredate:chararray,
                  sal:double,
                  comm:double,
                  deptno:int);

dept = LOAD ‘/user/cloudera/dept.csv’
             USING PigStorage(‘,’)
             as (deptno:int,
                   dname:chararray,
                   loc:chararray);

Execution Output:
Job Stats (time in seconds):
JobId Alias Feature Outputs
job_local1717475295_0001 emp MAP_ONLY file:/tmp/temp-1777562545/tmp716134895,

Input(s):
Successfully read records from: “file:///home/cloudera/Desktop/Docs/emp.csv”

Output(s):
Successfully stored records in: “file:/tmp/temp-1777562545/tmp716134895”

To avoid line-by-line execution, please run the command in a single line.

Pig comes with a built-in load function called Pig Storage. Also, whenever we wish to import data from a file system into the pig, we can use pig storage. The delimiter specification will aid in the separation of fields in records while loading data into Pig Storage.

The “emp” and “dept” in which data is loaded, are operators or aliases. Not the tables since Pig doesn’t have a concept of tables. The loaded data is saved in a temporary folder as a result of the command.

Apache Pig reads any file on your HDFS storage and saves the results of the analysis. The temporary folder, where data is loaded, is also part of HDFS. Hence it would be beneficial if you completely understand the HDFS filesystem so that you can locate the data you wish to analyze with Pig.

Summary: With the above command, we just loaded the data in the Pig buffer. There are no tables to create or drop. Once you quit the Pig-shell, the data needs to be imported again.

Storing the analyzed data:
Writing output to the file system is referred to as storing. The keyword STORE is followed by the name of the variable whose data is to be saved, as well as the storage location.

STORE empanalyzeddata INTO ‘/path’ USING PigStorage();

Hope you find this article helpful.

Please do click on the follow button get latest updates.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s