DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. Databases
  4. Tools for Moving from an SQL Database to Hadoop

Tools for Moving from an SQL Database to Hadoop

Mitch Pronschinske user avatar by
Mitch Pronschinske
·
Mar. 24, 10 · Interview
Like (0)
Save
Tweet
Share
32.80K Views

Join the DZone community and get the full member experience.

Join For Free
More organizations, small and large, are interested in moving their data from SQL relational databases to an Apache Hadoop back end.  Although SQL is always a solid option in many cases, Hadoop has advantages in data processing that are battle-tested by several major web properties.  These days it seems that Hadoop is the best game in town for "Big Data" handling.  For users that want to either experiment with Hadoop or put it into production-use, there are a couple of tools available for migrating SQL data to Hadoop.  They include: Cloudera's Sqoop, cascading.jdbc, and cascading-dbmigrate.

Sqoop
Cloudera knows Hadoop - that's for sure.  Doug Cutting, who created Hadoop, now works for Cloudera.  Sqoop is a FOSS tool that was contributed to the Apache Hadoop project.  It takes your database table and automatically generates the necessary classes that represent the rows from the table.  It finds the most efficient way to divide portions of the table for the MapReduce framework.  Once this is done, you can specify directions for importing data into the Hadoop Distributed File System (HDFS).  Sqoop has performance enhancements for loading and exporging data from MySQL and Oracle.  It supports both SequenceFile and text-based targets.  However, Sqoop doesn't detect unsigned columns and it will try to fit unsigned integers into a Java int because of MySQL's deviation from the JDBC standard.  Cloudera is planning to close these issues.

cascading.jdbc
Cascading is an open source API for defining and executing complex data processing workflows on a Hadoop cluster without having to 'think' in MapReduce.  It also allows developers to efficiently schedule complex distributed processes based on dependencies and other metadata.  cascading.jdbc (you might've guessed) is a JDBC adapter for Cascading.  It allows you to read and write data to and from an RDBMS using JDBC drivers that are attached to a Cascading data processing flow.  The adapter supports INSERT and UPDATE during sinking and can use custom SELECTs during sourcing.  Essentially, you can read what's in the database and run custom migration logic all in one job.  Performance, however, is hampered because cascading.jdbc uses LIMIT and OFFSET in its SQL queries.

cascading-dbmigrate
Nathan Marz liked the approach of cascading.jdbc and wanted to make his own tool for migrating large relational databases into Hadoop clusters.  His tool tries to relieve the performance issues of cascading.jdbc with a redesigned tap called DBMigrateTap to read data from a database in Cascading flow.  The tap reads data for each task with range queries over the table's primary key column.   It will emit tuples containing one field for each of the column reads, where the field names are the column names.  cascading-dbmigrate works on SQL tables that have a primary key column that is either an int or a long.  If this is your situation, you should try using cascading-dbmigrate to move your data since it's a relatively young open source project that needs feedback.  Marz's company, BackType, is using this solution now to migrate their data from SQL databases to HDFS.   
Database hadoop Relational database sql MySQL

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • 11 Observability Tools You Should Know
  • Steel Threads Are a Technique That Will Make You a Better Engineer
  • Spring Boot, Quarkus, or Micronaut?
  • Real-Time Analytics for IoT

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: