Tools for Moving from an SQL Database to Hadoop
Cloudera knows Hadoop - that's for sure. Doug Cutting, who created Hadoop, now works for Cloudera. Sqoop is a FOSS tool that was contributed to the Apache Hadoop project. It takes your database table and automatically generates the necessary classes that represent the rows from the table. It finds the most efficient way to divide portions of the table for the MapReduce framework. Once this is done, you can specify directions for importing data into the Hadoop Distributed File System (HDFS). Sqoop has performance enhancements for loading and exporging data from MySQL and Oracle. It supports both SequenceFile and text-based targets. However, Sqoop doesn't detect unsigned columns and it will try to fit unsigned integers into a Java int because of MySQL's deviation from the JDBC standard. Cloudera is planning to close these issues.
Cascading is an open source API for defining and executing complex data processing workflows on a Hadoop cluster without having to 'think' in MapReduce. It also allows developers to efficiently schedule complex distributed processes based on dependencies and other metadata. cascading.jdbc (you might've guessed) is a JDBC adapter for Cascading. It allows you to read and write data to and from an RDBMS using JDBC drivers that are attached to a Cascading data processing flow. The adapter supports INSERT and UPDATE during sinking and can use custom SELECTs during sourcing. Essentially, you can read what's in the database and run custom migration logic all in one job. Performance, however, is hampered because cascading.jdbc uses LIMIT and OFFSET in its SQL queries.
Nathan Marz liked the approach of cascading.jdbc and wanted to make his own tool for migrating large relational databases into Hadoop clusters. His tool tries to relieve the performance issues of cascading.jdbc with a redesigned tap called DBMigrateTap to read data from a database in Cascading flow. The tap reads data for each task with range queries over the table's primary key column. It will emit tuples containing one field for each of the column reads, where the field names are the column names. cascading-dbmigrate works on SQL tables that have a primary key column that is either an int or a long. If this is your situation, you should try using cascading-dbmigrate to move your data since it's a relatively young open source project that needs feedback. Marz's company, BackType, is using this solution now to migrate their data from SQL databases to HDFS.