Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Fastest Migration from MySQL to Redshift

DZone's Guide to

Fastest Migration from MySQL to Redshift

Your guide to the fastest possible migration path to Redshift (in less time than it takes to ignore an index!)

· Database Zone
Free Resource

Check out the IT Market Clock report for recommendations on how to consolidate and replace legacy databases. Brought to you in partnership with MariaDB.

Redshift, as you may already know, is quickly gaining broad acceptance, especially among consumers of free software like MySQL and PostgreSQL, for its “pay as you go” pricing model. However, the same pricing model can still make it a very expensive one. Not all queries need to be done against the Redshift instance itself, as it turns out, and not running these queries directly on your Redshift instance can save you a lot of money. Nonetheless, many DBAs familiar with MySQL will soon need a quick and effective migration path to Redshift.

Migration looks something like this:

  1. Extract (dump) the data from MySQL;
  2. Load the data into Redshift.
  3. Enjoy!
  4. Rinse, lather, repeat (especially with a very large dataset – you won’t be able to move all of the data in one go)!

But actually, there’s more to it:

  1. After I’ve extracted the data dump from MySQL, where do I store it before importing it to Redshift?
  2. What about schema- and type- compatibility issues? The two database types share some, but not all, of the same data types. What’s more, while Redshift can be thought of as a petabyte-scale version of Postgres (albeit, with significant changes), MySQL has its own MPP database, called ShardQuery, which works quite differently in fundamental ways.
  3. How can I automate the process? More than likely, you won’t be doing all the database migration in one go.

MySQL-question-redshift

So here’s the thing. Thinking traditionally, you may need to actually do something like this:

  1. Create a Redshift cluster.
  2. Export MySQL data, using an export query, and split them into multiple files.
  3. Upload the load files to Amazon S3.
  4. Run a COPY command (possibly in multiple iterations) to load the table.
  5. Verify that the data was loaded correctly.

As you might imagine, that’s A LOT of COPY commands and error-prone uploading when you’re talking billions of rows of data spread across multiple MySQL tables! Of course, you’d also better be sure that your export query is set up just right – trying to figure out exactly in which file something went wrong is the reason many DBAs change careers.

There’s an easier way.

Did you know you can address all of these issues by putting Treasure Data in the middle as a data lake and Redshift Ingestion funnel? Now you do.

  1. Treasure Data can store your database dumps from MySQL: With Treasure Data’s data connectors, you can easily dump – and store – your MySQL snapshots. We ingest up to 60 billion rows per day! What’s more, the connectors support SSL, so you know your data is secure.
  2. Treasure Data is schema-on-read: You can easily dump any strongly typed database snapshots without worrying about compatibility.
  3. Treasure Data can schedule jobs: You can automate – and schedule – pulling the data from MySQL as well as pushing the data into Redshift.

Migration: MySQL -> Treasure Data -> Redshift

MySQL_TD_RS
We’ll assume the following:

  1. You have a MySQL database running on Amazon RDS. Your database should have at least one port accessible from Treasure Data API servers. You can configure your security group to do this.
  2. You have a Redshift instance running, also accessible from Treasure Data.
  3. You have installed and configured Treasure Data Toolbelt on your system.

Let’s get started. NOTE: All of the command line stuff can be automated in a shell script, bash script or scripting language of your choice using system libraries. We’ll show you also how to schedule below. Also, in keeping these instructions as general as possible, we’ve either provided example values where something might be unclear or have started the variable names with your_. Please send us your suggestions in the comments section.

  1. Create seed.yml. We’ll include SSL settings. This should contain information on the host and database you’ll be pulling your data from:
    config:
      in:
        type:  mysql
        host:   your_mysql_host_name
        port:  3306
        user:  your_test_user
        password:  your_test_password
        options:
          requreSSL: true
          useSSL: true
          verifyServerCertificate: false
        database:  your_test_database
        table:  your_test_table
        select: “*”    #this selects everything
      out: 
        mode:  replace

 

  • ‘Guess’ the data. $ td connector:guess seed.yml -o load.yml
  • Preview the data before loading it into Treasure Data. (Click the image to enlarge.) $ td connector:preview load.ymltd-bukload_preview
  •  

  • Create the database and tables on Treasure Data. This is where you’ll dump your data from MySQL before you load it into Redshift. $ td db:create your_treasure_data_database && td table:create your_treasure_data_database your_treasure_data_table
  • Load the data from MySQL into your Treasure Data table. Note that you’ll need to specify a –time-column if there isn’t one. $ td connector:issue td-bulkload.yml –database your_treasure_data_database –table your_treasure_data_table –time-column your_timestamp_column
  • Now is the step where you query your data into the format you want to export to Redshift. Go to console.treasuredata.com and, for example, type out the query to get your first 20 items sorted by a value.
    SELECT
       *
    FROM
      your_treasure_data_table
    ORDER BY
      your_value ASC LIMIT 20
    •  Set it up to send the query results to Redshift. Host: my-amazon_host-rs.c5pqzaoivep8.us-west-2.redshift.amazonaws.com #Example Username: your_amazon_redshift_username Password: your_amazon_redshift_password Database: your_amazon_redshift_database Table: your_amazon_redshift_table Mode: append Method: insert console_query

     

  • Run the query.
  • View and query the results on Redshift. You can connect to your Redshift instance with SQL Workbench (or similar) and query your Redshift data.
  •  You can also check your transaction log on Redshift. You will save money by needing to run fewer queries and transformations directly on Redshift.

     

  • You can schedule an import using the td connector:create command. You’ll need the schedule, cron-style schedule, the database and table where their data will be stored, and the Data Connector configuration file, as follows:
    $ td connector:create \ daily_mysql_import \ “10 0 * * *” \ your_treasure_data_database \ your_treasure_data_table \ load.yml
  • Summing up

    Treasure Data provides a great solution to simplify migration from MySQL to Redshift. With our cloud data lake, you can unify your analytics infrastructure and you can publish your results, to various target systems – not just Redshift!


    Originally authored by John Hammink, Chief Evangelist for Treasure Data

    Interested in reducing database costs by moving from Oracle Enterprise to open source subscription?  Read the total cost of ownership (TCO) analysis. Brought to you in partnership with MariaDB.

    Topics:
    redshift ,mysql ,migration

    Published at DZone with permission of Kiyoto Tamura, DZone MVB. See the original article here.

    Opinions expressed by DZone contributors are their own.

    The best of DZone straight to your inbox.

    SEE AN EXAMPLE
    Please provide a valid email address.

    Thanks for subscribing!

    Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
    Subscribe

    {{ parent.title || parent.header.title}}

    {{ parent.tldr }}

    {{ parent.urlSource.name }}