Import Data From Hadoop Into CockroachDB

In this article, we are going to take a slight detour from docker compose and evaluate ingestion of data from Hadoop into CockroachDB.

Artem Ervits

CORE ·

Jan. 07, 22 · Tutorial

Likes (3)

Comment

Save

5.2K Views

CockroachDB can natively import data from HTTP endpoints, object storage with respective APIs, and local/NFS mounts. The full list of supported schemes can be found here.

It does not support the HDFS file scheme and we're left to our wild imagination to find alternatives.

As previously discussed, the Hadoop community is working on Hadoop Ozone, a native scalable object store with S3 API compatibility. For reference, here's my article demonstrating CockroachDB and Ozone integration. The limitation here is that you need to run Hadoop 3 to get access to it.

What if you're on Hadoop 2? There are several choices I can think of off the top of my head. One approach is to expose webhdfs and IMPORT using an http endpoint. The second option is to leverage previously discussed Minio to expose HDFS via HTTP or S3. Today, we're going to look at both approaches.

My setup consists of a single-node pseudo-distributed Hadoop cluster with Apache Hadoop 2.10.0 running inside a VM provisioned by Vagrant. Minio runs as a service inside the VM and CockroachDB is running inside a docker container on my host machine.

Information on CockroachDB can be found here.
Information on Hadoop Ozone can be found here.
Information on Minio can be found here.

Upload a file to HDFS.

I have a CSV file I created with my favorite data generator tool, Mockaroo.

curl "https://api.mockaroo.com/api/38ef0ea0?count=1000&key=a2efab40" > "part5.csv"
hdfs dfs -mkdir /data
hdfs dfs -chmod -R 777 /data
hdfs dfs -put part5.csv /data

Data looks like this:

id,first_name,last_name
fc2482b8-48e1-470d-976c-3b4134f44e65,Sandro,Bredee
719e2c11-e1fd-4129-b162-7557d20db728,Rock,Larkings
b6101f8c-a22b-4c51-bbab-058cd27aec6c,Nicol,Commins
63fc4f80-2d82-4dea-a86f-fa6d2a7f10cf,Maye,Columbine
5be11b08-80fb-4a24-bb32-edf565126cde,Marnie,Batte
75280e86-0558-4a67-8900-431fa8c3f3b2,Malena,Delves
cf8f5056-1cb0-4838-9afc-d1f58a56df0a,Shae,Betteson
e337f5a1-10c6-4d97-9590-38d2d6502285,Averil,Twigge
553e83d6-ce93-43c2-a64e-f8361cd65820,Amanda,Brundle

Enable webhdfs on your Hadoop cluster.

The main thing task here is to edit hdfs-site.xml file and make sure the following property is set to true.

<property>
  <name>dfs.webhdfs.enabled</name>
  <value>true</value>
  <final>true</final>
</property>

A more in-depth overview of webhdfs can be found on the Hadoop website.

Make sure the file is accessible via WebHDFS:

hdfs dfs -ls webhdfs://hadoop.example.com:50070/data/

Found 1 items
-rw-r--r--   1 vagrant hdfs      52061 2020-01-16 15:40 webhdfs://hadoop.example.com:50070/data/part5.csv

Unfortunately, webhdfs scheme is not supported and we need to rely on the http protocol for our test. If you're curious, CockroachDB will output the following message if we attempt to import using webhdfs scheme.

root@:26257/defaultdb> IMPORT TABLE importhdfs
(
  id UUID NOT NULL PRIMARY KEY,
        first_name STRING NOT NULL,
        last_name STRING NOT NULL,
        INDEX idx_last_name (last_name ASC)
) CSV DATA ('webhdfs://hadoop.example.com:50070/data/part5.csv')
WITH skip = '1';
pq: unsupported storage scheme: "webhdfs"

Let's try accessing the file over http using curl

curl -i "http://hadoop.example.com:50070/webhdfs/v1/data/?op=LISTSTATUS"

HTTP/1.1 200 OK
Cache-Control: no-cache
Expires: Thu, 16 Jan 2020 15:52:18 GMT
Date: Thu, 16 Jan 2020 15:52:18 GMT
Pragma: no-cache
Expires: Thu, 16 Jan 2020 15:52:18 GMT
Date: Thu, 16 Jan 2020 15:52:18 GMT
Pragma: no-cache
Content-Type: application/json
X-FRAME-OPTIONS: SAMEORIGIN
Transfer-Encoding: chunked

{"FileStatuses":{"FileStatus":[
{"accessTime":1579189250309,"blockSize":134217728,"childrenNum":0,"fileId":16405,"group":"hdfs","length":52061,"modificationTime":1579189251025,"owner":"vagrant","pathSuffix":"part5.csv","permission":"644","replication":1,"storagePolicy":0,"type":"FILE"}
]}}

Again, depending on your situation, you will need to authenticate and configure Kerberos.

Read the file using curl:

curl -i -L "http://hadoop.example.com:50070/webhdfs/v1/data/part5.csv?op=OPEN&user.name=vagrant"

Import into CockroachDB:

root@:26257/defaultdb> IMPORT TABLE importhdfs
(
  id UUID NOT NULL PRIMARY KEY,
        first_name STRING NOT NULL,
        last_name STRING NOT NULL,
        INDEX idx_last_name (last_name ASC)
) CSV DATA ('http://hadoop.example.com:50070/webhdfs/v1/data/part5.csv')
WITH skip = '1';
pq: error response from server: 400 Bad Request "{\"RemoteException\":{\"exception\":\"UnsupportedOperationException\",\"javaClassName\":\"java.lang.UnsupportedOperationException\",\"message\":\"op=NULL is not supported\"}}"

Let's try again adding ?op=OPEN to the path.

IMPORT TABLE importhdfs
(
  id UUID NOT NULL PRIMARY KEY,
    first_name STRING NOT NULL,
    last_name STRING NOT NULL,
    INDEX idx_last_name (last_name ASC)
) CSV DATA ('http://hadoop.example.com:50070/webhdfs/v1/data/part5.csv?op=OPEN')
WITH skip = '1';

root@:26257/defaultdb> IMPORT TABLE importhdfs
(
  id UUID NOT NULL PRIMARY KEY,
        first_name STRING NOT NULL,
        last_name STRING NOT NULL,
        INDEX idx_last_name (last_name ASC)
) CSV DATA ('http://hadoop.example.com:50070/webhdfs/v1/data/part5.csv?op=OPEN')
WITH skip = '1';
        job_id       |  status   | fraction_completed | rows | index_entries | system_records | bytes
+--------------------+-----------+--------------------+------+---------------+----------------+-------+
  521406882554839041 | succeeded |                  1 | 1000 |          1000 |              0 | 81215
(1 row)

Time: 153.8283ms

And here we are, we just loaded data into CockroachDB from Hadoop:

root@:26257/defaultdb> select * from importhdfs limit 10;
                   id                  | first_name | last_name
+--------------------------------------+------------+-----------+
  00aca9b0-3d78-4664-9f5c-9a58ba481aff | Kerk       | Viney
  00d37d61-f78b-4755-9aef-9e3112999f2e | Zita       | Brisker
  01e8422f-8c07-4989-b740-67b6811c6cef | Sylas      | Hatchard
  020106fe-f6ca-4715-ba7e-548a883e0a33 | Sapphire   | Good
  031988cb-e760-458d-b6c6-902ebd52f470 | Rica       | Winley
  0330c19f-c0da-4b38-aba4-8e0fd264ed04 | Korie      | Blazic
  033c08b5-fd04-4dc1-a612-2b991e6da619 | Walsh      | Kernes
  0382a85b-7140-4041-87a0-96da852c5c4d | Fredia     | Boliver
  038a0ff4-4055-4dec-8762-658dc096b997 | Marven     | Baddow
  039ca369-113a-4d0b-9b15-24d2c49489fa | Yulma      | Adamowicz
(10 rows)

Use Minio as HDFS Gateway.

I was not aware of this but on a hunch, I searched whether Minio could serve as an HDFS Gateway, and alas it does! Though at the moment, this feature lives on Github and not in their docs so I question the supportability. Also, because we're testing with smaller volumes, using Minio may have additional benefits to load balance the data, how scalable and feasible it is, I leave it to you to test.

Setting it up is also a breeze.

Set up Minio as an HDFS gateway.

wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod +x minio
export MINIO_ACCESS_KEY=minio
export MINIO_SECRET_KEY=minio123
./minio gateway hdfs hdfs://hadoop.example.com:8200

I used the hadoop user to start the gateway, security implications of which should be left to the reader's judgment. My hadoop fs.defaultFS property is mapped to port 9000. Since Minio gateway also uses port 9000 by default, I edited core-site.xml to reflect port 8200.

<property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop.example.com:8200</value>
</property>

With that, I was able to start the gateway with ./minio gateway hdfs hdfs://hadoop.example.com:8200

When the gateway is up, you should see:

               *** Warning: Not Ready for Production ***
Endpoint:  http://10.0.2.15:9000  http://192.168.33.10:9000  http://127.0.0.1:9000
AccessKey: minio
SecretKey: minio123

Browser Access:
   http://10.0.2.15:9000  http://192.168.33.10:9000  http://127.0.0.1:9000

Command-line Access: https://docs.min.io/docs/minio-client-quickstart-guide
   $ mc config host add myhdfs http://10.0.2.15:9000 minio minio123

Object API (Amazon S3 compatible):
   Go:         https://docs.min.io/docs/golang-client-quickstart-guide
   Java:       https://docs.min.io/docs/java-client-quickstart-guide
   Python:     https://docs.min.io/docs/python-client-quickstart-guide
   JavaScript: https://docs.min.io/docs/javascript-client-quickstart-guide
   .NET:       https://docs.min.io/docs/dotnet-client-quickstart-guide

Once you browse to the web UI and input the credentials, you should see the file at the root of the /data dir.

At this point, we should be able to execute import pointing to the file with the HTTP scheme.

Use Minio to import HDFS data into CockroachDB.

Still in the Minio UI, click on the three dots to the right of the file in the browser and hit share button

Click Copy Link

Then use the copied URL in your IMPORT statement.

IMPORT TABLE importhdfs
(
  id UUID NOT NULL PRIMARY KEY,
    first_name STRING NOT NULL,
    last_name STRING NOT NULL,
    INDEX idx_last_name (last_name ASC)
) CSV DATA ('http://192.168.33.10:9000/data/part5.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minio%2F20200116%2F%2Fs3%2Faws4_request&X-Amz-Date=20200116T163901Z&X-Amz-Expires=432000&X-Amz-SignedHeaders=host&X-Amz-Signature=0a9ecd6731581c193746fd6113a5c094dfc061e2edf194b77554e80a770f68e4')
WITH skip = '1';

Voila!

root@:26257/defaultdb> IMPORT TABLE importhdfs
(
  id UUID NOT NULL PRIMARY KEY,
        first_name STRING NOT NULL,
        last_name STRING NOT NULL,
        INDEX idx_last_name (last_name ASC)
) CSV DATA ('http://192.168.33.10:9000/data/part5.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minio%2F20200116%2F%2Fs3%2Faws4_request&X-Amz-Date=20200116T163901Z&X-Amz-Expires=432000&X-Amz-SignedHeaders=host&X-Amz-Signature=0a9ecd6731581c193746fd6113a5c094dfc061e2edf194b77554e80a770f68e4')
WITH skip = '1';
        job_id       |  status   | fraction_completed | rows | index_entries | system_records | bytes
+--------------------+-----------+--------------------+------+---------------+----------------+-------+
  521413400760745985 | succeeded |                  1 | 1000 |          1000 |              0 | 81215
(1 row)

Time: 169.0506ms

Here's the same example except we're using the S3 scheme for our HDFS data.

IMPORT TABLE importhdfs
(
  id UUID NOT NULL PRIMARY KEY,
    first_name STRING NOT NULL,
    last_name STRING NOT NULL,
    INDEX idx_last_name (last_name ASC)
) CSV DATA ('s3://data/part5.csv?AWS_ACCESS_KEY_ID=minio&AWS_SECRET_ACCESS_KEY=minio123&AWS_ENDPOINT=http://hadoop.example.com:9000')
WITH skip = '1';

AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_ENDPOINT are all required properties. AWS_ENDPOINT is pointing to the machine hosting my Hadoop instance as well as the minio gateway.

At this point you've seen two methods to access HDFS data for import into CockroachDB, I'm wondering if I could use the same methods for Change Data Capture and Backup/Restore in Hadoop! That's an idea for another day.

Please share your feedback in the comments.

hadoop Data (computing) CockroachDB

Published at DZone with permission of Artem Ervits. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending

Import Data From Hadoop Into CockroachDB

In this article, we are going to take a slight detour from docker compose and evaluate ingestion of data from Hadoop into CockroachDB.

Related

Partner Resources