Sqoop: Import Data From MySQL to Hive

DZone 's Guide to

Sqoop: Import Data From MySQL to Hive

Getting data from a relational database into HDFS is usually the first step in querying it using Hive. Let's start Sqooping.

· Big Data Zone ·
Free Resource

sqoop drawing.png

Prerequisite: Hadoop Environment with Sqoop and Hive installed and working. To speed up the work, I am using Cloudera Quickstart VM  (requires 4GB of RAM), although you can also work with Hortonworks Data Platform (requires 8GB of RAM). Since my laptop has only 8GB of RAM, I prefer to work with the Cloudera VM image.

If you are working with Cloudera/HDP VM and its all fired up in VirtualBox, it becomes easier to work with many of the Hadoop ecosystem packages, as they come pre-installed: MySQL, Oozie, Hadoop, Hive, Zookeeper, Storm, Kafka, Spark, etc.

Create Table in MySQL

In Cloudera VM, open the command prompt and make sure MySQL is installed. For me, it’s:

shell> mysql --version
mysql  Ver 14.14 Distrib 5.1.66, for redhat-linux-gnu (x86_64) using readline 5.

You should always work in your own database, so create a database in MySQL using:

mysql> create database sqoop;


mysql> use sqoop;
mysql> create table customer(id varchar(3), name varchar(20), age varchar(3), salary integer(10));
Query OK, 0 rows affected (0.09 sec)

mysql> desc customer;
| Field  | Type        | Null | Key | Default | Extra |
| id     | varchar(3)  | YES  |     | NULL    |       |
| name   | varchar(20) | YES  |     | NULL    |       |
| age    | varchar(3)  | YES  |     | NULL    |       |
| salary | int(10)     | YES  |     | NULL    |       |
mysql> select * from customer;
| id   | name   | age  | salary |
| 1    | John   | 30   |  80000 |
| 2    | Kevin  | 33   |  84000 |
| 3    | Mark   | 28   |  90000 |
| 4    | Jenna  | 34   |  93000 |
| 5    | Robert | 32   | 100000 |
| 6    | Zoya   | 40   |  60000 |
| 7    | Sam    | 37   |  75000 |
| 8    | George | 31   |  67000 |
| 9    | Peter  | 23   |  70000 |
| 19   | Alex   | 26   |  74000 |

Let’s Start Sqooping

As you can see, the customer table does not have any primary key. I have added a few records in the customer table. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range.

If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by argument. For example --split-by id

Since I want to import this table directly into Hive I am adding –hive-import to my Sqoop command.

sqoop import --connect jdbc:mysql://localhost:3306/sqoop 
--username root 
--split-by id 
--columns id,name 
--table customer  
--target-dir /user/cloudera/ingest/raw/customers 
--fields-terminated-by "," 
--hive-table sqoop_workspace.customers

here’s what the individual Sqoop command options mean:

  • connect – provide JDBC string
  • username – database username
  • -P  – ask the password in the console. Alternatively, you can use –password, but this is not a good practice, as it's visible in your job execution logs and asking for trouble. One way to deal with this is to store the DB password in a file in HDFS and provide at runtime.
  • table – tell which table you want to import from MySQL – here it's customer.
  • split-by – specify your splitting column. I am specifying id here.
  • target-dir – HDFS destination dir.
  • fields-terminated-by – I have specified comma, as by default it will import data into HDFS with comma separated values.
  • hive-import – Import table into the hive (uses Hive’s default delimiters if none are set).
  • create-hive-table – If set, job will fail if Hive table already exists. It works in this case.
  • hive-table – specify <db_name>.<table_name>. Here it's sqoop_workspace.customers where sqoop_workspace is my database and customers is the table name.

As you can see below, Sqoop is a map-reduce job. Notice that I am using -P for password option.  This works, but it can be easily parameterized by using –password  and reading it from a file.

sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username root -P --split-by id --columns id,name --table customer  --target-dir /user/cloudera/ingest/raw/customers --fields-terminated-by "," --hive-import --create-hive-table --hive-table sqoop_workspace.customers
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
16/03/01 12:59:44 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.5.0
Enter password: 
16/03/01 12:59:54 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
16/03/01 12:59:54 INFO tool.CodeGenTool: Beginning code generation
16/03/01 12:59:55 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customer` AS t LIMIT 1
16/03/01 12:59:56 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customer` AS t LIMIT 1
16/03/01 12:59:56 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/6471c43b5c867834458d3bf5a67eade2/customer.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/03/01 13:00:01 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/6471c43b5c867834458d3bf5a67eade2/customer.jar
16/03/01 13:00:01 WARN manager.MySQLManager: It looks like you are importing from mysql.
16/03/01 13:00:01 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
16/03/01 13:00:01 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
16/03/01 13:00:01 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
16/03/01 13:00:01 INFO mapreduce.ImportJobBase: Beginning import of customer
16/03/01 13:00:01 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
16/03/01 13:00:02 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
16/03/01 13:00:04 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
16/03/01 13:00:05 INFO client.RMProxy: Connecting to ResourceManager at /
16/03/01 13:00:11 INFO db.DBInputFormat: Using read commited transaction isolation
16/03/01 13:00:11 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `customer`
16/03/01 13:00:11 WARN db.TextSplitter: Generating splits for a textual index column.
16/03/01 13:00:11 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
16/03/01 13:00:11 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
16/03/01 13:00:11 INFO mapreduce.JobSubmitter: number of splits:4
16/03/01 13:00:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1456782715090_0004
16/03/01 13:00:13 INFO impl.YarnClientImpl: Submitted application application_1456782715090_0004
16/03/01 13:00:13 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1456782715090_0004/
16/03/01 13:00:13 INFO mapreduce.Job: Running job: job_1456782715090_0004
16/03/01 13:00:47 INFO mapreduce.Job: Job job_1456782715090_0004 running in uber mode : false
16/03/01 13:00:48 INFO mapreduce.Job:  map 0% reduce 0%
16/03/01 13:01:43 INFO mapreduce.Job:  map 25% reduce 0%
16/03/01 13:01:46 INFO mapreduce.Job:  map 50% reduce 0%
16/03/01 13:01:48 INFO mapreduce.Job:  map 100% reduce 0%
16/03/01 13:01:48 INFO mapreduce.Job: Job job_1456782715090_0004 completed successfully
16/03/01 13:01:48 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=548096
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=409
        HDFS: Number of bytes written=77
        HDFS: Number of read operations=16
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=8
    Job Counters 
        Launched map tasks=4
        Other local map tasks=5
        Total time spent by all maps in occupied slots (ms)=216810
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=216810
        Total vcore-seconds taken by all map tasks=216810
        Total megabyte-seconds taken by all map tasks=222013440
    Map-Reduce Framework
        Map input records=10
        Map output records=10
        Input split bytes=409
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=2400
        CPU time spent (ms)=5200
        Physical memory (bytes) snapshot=418557952
        Virtual memory (bytes) snapshot=6027804672
        Total committed heap usage (bytes)=243007488
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=77
16/03/01 13:01:48 INFO mapreduce.ImportJobBase: Transferred 77 bytes in 104.1093 seconds (0.7396 bytes/sec)
16/03/01 13:01:48 INFO mapreduce.ImportJobBase: Retrieved 10 records.
16/03/01 13:01:49 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customer` AS t LIMIT 1
16/03/01 13:01:49 INFO hive.HiveImport: Loading uploaded data into Hive

Logging initialized using configuration in jar:file:/usr/jars/hive-common-1.1.0-cdh5.5.0.jar!/hive-log4j.properties
Time taken: 2.163 seconds
Loading data to table sqoop_workspace.customers
chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00000': User does not belong to supergroup
chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00001': User does not belong to supergroup
chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00002': User does not belong to supergroup
chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00003': User does not belong to supergroup
Table sqoop_workspace.customers stats: [numFiles=4, totalSize=77]
Time taken: 1.399 seconds

Finally, let’s verify the output in Hive.

Verify Output in Hive

hive> show databases;
Time taken: 0.034 seconds, Fetched: 2 row(s)
hive> use sqoop_workspace;
Time taken: 0.063 seconds
hive> show tables;
Time taken: 0.036 seconds, Fetched: 1 row(s)
hive> show create table customers;
CREATE TABLE `customers`(
  `id` string, 
  `name` string)
COMMENT 'Imported by sqoop on 2016/03/01 13:01:49'
Time taken: 0.26 seconds, Fetched: 18 row(s)

hive> select * from customers;
1    John
2    Kevin
19    Alex
3    Mark
4    Jenna
5    Robert
6    Zoya
7    Sam
8    George
9    Peter
Time taken: 1.123 seconds, Fetched: 10 row(s).

That’s it for now. Hope you find it useful. Thanks for your support and reading my blog.

hive, sqoop, virtualisation

Published at DZone with permission of Hardik Pandya , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}