Getting to Know Amazon Elastic MapReduce
Getting to Know Amazon Elastic MapReduce
Join the DZone community and get the full member experience.Join For Free
Amazon Elastic MapReduce is a service in the AWS portfolio that can be used for data processing and analytics on vast amounts of data. It is based on Hadoop (as of writing this article it is using Hadoop 0.20.205) and relies on other AWS services such as EC2 and S3.
The data processing applications can be implemented using various technologies such as Hive, Pig, Java (Custom Jar) and Streaming (e.g. python or ruby). This post will demonstrate how to use Hive on Amazon Elastic MapReduce – the sample application will calculate the average price of Apple stock in every year from 1984 till 2012. At the time of writing Hive version is 0.7.1 . (Side note: as it will be shown, AAPL started at around 25 USD as an average price in 1984, managed to get down to 18 USD in 1997 and now it is around 500 – 496.32138, to be more precise -, quite some numbers for a company that is in Infinite Loop for decades…)
How to create Elastic MapReduce Jobs?
There are three steps to manage an EMR jobflow:
1./ Upload the script (i.e. hive.q file) and the data to be processed onto S3. If you are unfamiliar with AWS, this is a good place to start to understand its structure and the way how to use it.
The test data used in the post is downloaded from Yahoo! Finance website (Historical data for AAPL stock). Go to http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices and then scroll down to Download to Spreadsheet link. This will create a csv file (~6,950 lines) with the following columns: Date,Open,High,Low,Close,Volume,Adj Close. Remove the header (the first line) to leave only the relevant data in the csv file.
Steps to upload the input files:
a./ go to AWS S3 console and create “stockprice” bucket:
b./ create folders under stockprice bucket: apple/input, apple/output and hive-scripts.
c./ upload apple.q hive-script into //stockprice/hive-scripts folder
d./ upload the csv input file containing AAPL stock prices into //stockprice/apple/input folder
2./ Create an Elastic MapReduce jobflow:
Natigate to https://console.aws.amazon.com/elasticmapreduce/home
b./ select “Create New Job Flow”
c./ configure job parameters:
d./ configure EC2 instances:
e./ define EC2 key pair:
f./ if you want, you can configure debugging by defining a S3 log path and selecting “Enable Debugging” (optional). I highly recommend to do it if you are in development phase:
g./ Set no bootstrap actions:
h./ review the configuration before you hit the run button:
i./ create job flow:
j./ you can verify the job flow status from STARTING to RUNNING to SHUTDOWN.
Should there be any issues occuring, you can check the stderr, stdout, syslog from “Debug” menu.
3./ Check the result:
After a few minutes of number crunching, the output will be generated in //stokcprice/apple/output folder (e.g. 000000 file). The file will have a text format with date and stock price cloumns (separeted by SOH – start of heading – ascii 001), see:
The hive code to process the data (apple.q) looks like this:
CREATE EXTERNAL TABLE stockprice ( yyyymmdd STRING, open_price FLOAT, high_price FLOAT, low_price FLOAT, close_price FLOAT, stock_volume INT, adjclose_price FLOAT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘\n’ LOCATION ‘s3://stockprice/apple/input/’; CREATE TABLE tmp_stockprice ( year INT, close_price FLOAT ) STORED AS SEQUENCEFILE; INSERT OVERWRITE TABLE tmp_stockprice SELECT YEAR(sp.yyyymmdd), sp.close_price FROM stockprice sp; CREATE TABLE avg_yearly_stockprice ( year INT, avg_price FLOAT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ LINES TERMINATED BY ‘\n’ STORED AS TEXTFILE; INSERT OVERWRITE TABLE avg_yearly_stockprice SELECT tmp_sp.year, avg(tmp_sp.close_price) FROM tmp_stockprice tmp_sp GROUP BY tmp_sp.year; INSERT OVERWRITE DIRECTORY ‘s3://stockprice/apple/output/’ SELECT * from avg_yearly_stockprice;
Alternatively you can define LOCATION for avg_yearly_stockprice in a similar way (external table) as it is done stockprice table instead of INSERT OVERWRITE DIRECTORY.
Published at DZone with permission of Istvan Szegedi , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.