In this tutorial we would try to explore some functions for
data processing and learn them by example. We would be using PIG script to find
out the Maximum runs scored by a player from a baseball stat data. You can
download the sample file from http://seanlahman.com/files/database/lahman591-csv.zip.
It’s a collection of files having information about baseball
stats while if you unzip this you would have number of files we will be using “Batting.csv” in this tutorial.
Prerequisite: To understand this tutorial you should have a
basic knowledge about PIG and loading file into H-Catalog. Please visit below
URL for more details
Steps:
1.
Logon to Hadoop console and Click on “PIG” Icon
and Click “New Script”.
2.
Write down the code below in the editing area
a) batting = load
'Batting.csv' using PigStorage(',');
b) runs = FOREACH
batting GENERATE $0 as playerID, $1 as year , $8 as runs;
c) grp_data =
GROUP runs by (year);
d) max_runs =
FOREACH grp_data GENERATE group as grp , MAX(runs.runs) as max_runs;
e) join_max_run =
JOIN max_runs by ($0, max_runs), runs by (year,runs);
f) join_data =
FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
g) dump join_data;
3.
Click on “Save”.
Now let’s try to understand code line by line and keep in
mind what we want to do is to find out the Max score done by a player in any
year.
batting = load 'Batting.csv'
using PigStorage(',');
This is pretty simple we just want to load one file using
PigStorage() function it will load the file structure into “batting” variable (
an array actually).
runs = FOREACH batting GENERATE
$0 as playerID, $1 as year , $8 as runs;
Now here we just want to extract those columns which are
useful for us into a separate array you can see that data is accessed based on
the $index as first index in Batting.csv contain PlayerId ( name) and 8th
index contains the score done by player.
grp_data = GROUP runs by
(year);
GROUP function only groups data into chunks based on the
column we specify in our case the data groups will be created based on YEAR
which means all records that belongs to one year will be stored together in ONE
ROW. If you want to understand it better execute below in PIG editir area
DESCRIBE grp_data; //
Results of this statement is below
grp_data: {group:
bytearray,runs: {(playerID: bytearray,year: bytearray,runs: bytearray)}}
This is how GROUP looks like and if you look at that 'group'
is the name of first column by default n will contain the "year"
value as defined in GROUP BY clause.
The second column is an Object which means you will have
records which falls in this group. e.g in this year.
max_runs = FOREACH grp_data
GENERATE group as grp , MAX(runs.runs) as max_runs;
Now for each year it will calculate the MAX runs scored.
join_max_run = JOIN max_runs by
($0, max_runs), runs by (year,runs);
JOIN works in similar fashion as it does in SQL it is
joining the $0 (Year) and max runs ( line d) with Year and Runs with “runs” ( line b in
code) array.
In SQL terms the join is as follow
Max_runs inner join
Runs on Max_runs.Year
= runs.year
And Max_runs.max_runs = runs.runs;
join_data = FOREACH join_max_run
GENERATE $0 as year, $2 as playerID, $1 as runs;
This last statement just generate a new final dataset by
extracting Year , Player Name and MAX runs scored from the joined data set.
The final output of this program is as follows
(1871,barnero01,66.0)
(1872,eggleda01,94.0)
(1873,barnero01,125.0)
(1874,mcveyca01,91.0)
(1875,barnero01,115.0)
(1876,barnero01,126.0)
…
…
Thanks