DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Building Scalable Real-Time Apps with AstraDB and Vaadin
Register Now

Trending

  • What Is Envoy Proxy?
  • Demystifying SPF Record Limitations
  • Scaling Site Reliability Engineering (SRE) Teams the Right Way
  • Java String Templates Today

Trending

  • What Is Envoy Proxy?
  • Demystifying SPF Record Limitations
  • Scaling Site Reliability Engineering (SRE) Teams the Right Way
  • Java String Templates Today
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Elastic MapReduce and Data in S3

Elastic MapReduce and Data in S3

Oliver Hookins user avatar by
Oliver Hookins
·
Mar. 06, 14 · Interview
Like (0)
Save
Tweet
Share
4.64K Views

Join the DZone community and get the full member experience.

Join For Free

I don’t have to do much data analysis fortunately, but when I do there are two options: either the data is local to our own datacenter and I can use our own Hadoop cluster, or it is external and I can use Elastic MapReduce. Generally you don’t run an Elastic MapReduce cluster all the time, so when you create your cluster you still need to get that data into the system somehow. Usually the easiest way is to use one of your existing running instances outside of the MapReduce system to transfer it from wherever it may be to S3. If you are lucky, the data is already in S3.

Even better, Elastic MapReduce has the ability to run jobs against datasets located in S3 (rather than on HDFS as is usually the case). I believe this used to be a customisation AWS has applied to Hadoop, but has been in mainline for some time now. It is really quite simple – instead of supplying an absolute or relative path to your hdfs datastore, you can provide an S3-style URI to the data such as: s3://my-bucket-name/mydata/

The “magic” here is not that it now runs the job against S3 directly, but it will create a job before your main workflow to copy the data over from S3 to HDFS. Unfortunately, it’s a bit slow. Previously it has also had showstopper bugs which prevented it working for me at all, but in a lot of cases I just didn’t care enough and used it anyway. Today’s job had significantly more data, and so I decided to copy the data over by hand. I knew it was faster, but not as much of a difference as this:

Screen Shot 2014-02-28 at 5.56.15 PM

The first part of the graph is the built-in copy operation as part of the job I had started, and where it steepens significantly is where I stopped the original job and started the S3DistCp command. Its usage is relatively simple:

hadoop fs -mkdir hdfs:///data/
hadoop jar lib/emr-s3distcp-1.0.jar --src s3n://my-bucket-name/path/to/logs/ --dest hdfs:///data/

The s3distcp jar file is already loaded on the master node when it is bootstrapped, so you can do this interactively or as part of a step on a cluster you have running automatically. I thoroughly recommend using it, as it will cut down the total time of your job significantly!


AWS Data (computing) hadoop MapReduce

Published at DZone with permission of Oliver Hookins, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • What Is Envoy Proxy?
  • Demystifying SPF Record Limitations
  • Scaling Site Reliability Engineering (SRE) Teams the Right Way
  • Java String Templates Today

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: