In this tutorial we will get started to load data into Horton Hadoop .We would explain step by step instructions to load data using Hcatalog and will later use Hive and do simple data manipulation on the loaded data using HIVE SQL.

Refer to http://tahir-aziz.blogspot.com/2013/07/introduction-to-big-data-and-hadoop.html if you are new to Hadoop and need more basic information.

Loading Data using HIVE

You can load sample data into Hadoop using File Explorer or Hive Directly. Let’s use Hive UI interface to create a table using File as input. For this example we are going to use the sample file available at Hadoop website (Download this file and save at your disk)

https://s3.amazonaws.com/hw-sandbox/tutorial1/NYSE-2000-2001.tsv.gz

Follow below steps to create table from file and load data into target table.

1. Click on “HCatalog Icon” and Click on “Create a new table from File. (See Fig below).

2. Fill the required details enter Table name, Description and chose to upload file which you have saved on your disk.

3. Click Next and on next step “Choose Delimiter” Chose the delimiter for this sample file it’s TAB but other options are also available if your file delimiter is something else e.g. comma , Pipe etc.

4. Click Next and on next step “Define Columns” Check the data types by default it treats everything as string you can change the data type to suitable data type as per the column contents. E.g. in our current example file stock_price_high, stock_price_low etc are good candidates to be handled as float so change it.

5. Once you are done the process will create table and will also load the table with the file which you have just selected in above steps.

So what is next now? Let’s check the data content using the HIVE SQL and verify if the contents are loaded properly and all data types are OK.

Click on Hive UI and you will see an interface where you can enter your query and execute them.

To check if the table has been created and what are the data types you can enter your query as

describe nyse_stocks_t;

It will display the column names and data type of the columns.

Now let’s try to do a simple manipulation let’s say we want to calculate the average of Stock volume for IBM. HIVE SQL is similar to standard SQL and supports almost all SQL aggregations. You can write below HIVE SQL to calculate the average.

select AVG(stock_volume) from nyse_stocks_t

where stock_symbol == 'IBM';

This will return the Average for IBM.

Hope you learned the basics.

Thanks

Tahir Aziz

Tahir Aziz

Tuesday, August 20, 2013

Getting Started with Hadoop:Loading Data using Hcatalog/Hive

Loading Data using HIVE

No comments:

Post a Comment

About Me