Sunday, September 1, 2013

Data Processing with Pig : Understanding PIG functions



In this tutorial we would try to explore some functions for data processing and learn them by example. We would be using PIG script to find out the Maximum runs scored by a player from a baseball stat data. You can download the sample file from http://seanlahman.com/files/database/lahman591-csv.zip.

It’s a collection of files having information about baseball stats while if you unzip this you would have number of files we will be using “Batting.csv”  in this tutorial.
Prerequisite: To understand this tutorial you should have a basic knowledge about PIG and loading file into H-Catalog. Please visit below URL for more details

Steps:

1.       Logon to Hadoop console and Click on “PIG” Icon and Click “New Script”.
2.       Write down the code below in the editing area
a)      batting = load 'Batting.csv' using PigStorage(','); 
b)      runs = FOREACH batting GENERATE $0 as playerID, $1 as year , $8 as runs; 
c)       grp_data = GROUP runs by (year); 
d)      max_runs = FOREACH grp_data GENERATE group as grp , MAX(runs.runs) as max_runs;
e)      join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
f)       join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
g)      dump join_data;
3.       Click on “Save”.
Now let’s try to understand code line by line and keep in mind what we want to do is to find out the Max score done by a player in any year.
batting = load 'Batting.csv' using PigStorage(',');   
This is pretty simple we just want to load one file using PigStorage() function it will load the file structure into “batting” variable ( an array actually).
runs = FOREACH batting GENERATE $0 as playerID, $1 as year , $8 as runs; 
Now here we just want to extract those columns which are useful for us into a separate array you can see that data is accessed based on the $index as first index in Batting.csv contain PlayerId ( name) and 8th index contains the score done by player.
grp_data = GROUP runs by (year); 
GROUP function only groups data into chunks based on the column we specify in our case the data groups will be created based on YEAR which means all records that belongs to one year will be stored together in ONE ROW. If you want to understand it better execute below in PIG editir area
DESCRIBE grp_data;   // Results of this statement is below
grp_data: {group: bytearray,runs: {(playerID: bytearray,year: bytearray,runs: bytearray)}}
This is how GROUP looks like and if you look at that 'group' is the name of first column by default n will contain the "year" value as defined in GROUP BY clause.
The second column is an Object which means you will have records which falls in this group. e.g in this year.
max_runs = FOREACH grp_data GENERATE group as grp , MAX(runs.runs) as max_runs;
Now for each year it will calculate the MAX runs scored.
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
JOIN works in similar fashion as it does in SQL it is joining the $0 (Year) and max runs ( line d)  with Year and Runs with “runs” ( line b in code) array. 

In SQL terms the join is as follow
Max_runs inner join
Runs on  Max_runs.Year = runs.year
And Max_runs.max_runs = runs.runs;

join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
This last statement just generate a new final dataset by extracting Year , Player Name and MAX runs scored from the joined data set.
The final output of this program is as follows
(1871,barnero01,66.0)
(1872,eggleda01,94.0)
(1873,barnero01,125.0)
(1874,mcveyca01,91.0)
(1875,barnero01,115.0)
(1876,barnero01,126.0)

Thanks