{{announcement.body}}
{{announcement.title}}

How Storj Is Migrating From Postgres to CockroachDB: Bulk Loading Performance

DZone 's Guide to

How Storj Is Migrating From Postgres to CockroachDB: Bulk Loading Performance

In this article, take a look at the process of migrating from Postgres to CockroachDB.

· Database Zone ·
Free Resource

I started using CockroachDB about four months and its been really great, I love the product. Here is a story about how I recently migrated a large database (~400GB) from PostgreSQL over to CockroachDB. This blog is a recap of my process and also of some of the tricky parts I encountered.

Setup Details

CockroachDB version: 19.2.6, Cockroach Cloud running in Google Cloud (GCP), 3 node cluster with 1.5TB capacity

Postgres version: 9.6, GCP CloudSQL

Overview, tl;dr

CockroachDB has documentation on the topic of migrating from Postgres:

The docs are helpful to show the options for executing the migration, however they don’t have much information on how to optimize for large migrations to efficiently bulk load large amounts of data into CockroachDB. For example, I wanted to stay under 4 hrs of downtime so I needed a migration strategy that would be fast since it entailed migrating a 375GB database.

From the docs, there are essentially two different ways to perform the migration:

  1. pg_dump the data from Postgres (either one table at a time or a full database dump)
  2. \copy or COPY from Postgres to a CSV one table at a time

High level lessons learned (details after if interested):

  • don’t keep the schema definition in the same file as the data to be imported
  • if one table is very large, split the large data dump across many files so that the import can be parallelized across all the nodes in the cockroach cluster
  • importing from a CSV is faster than importing from a pg_dump file in CockroachDB since the CSV importer has been optimized more
  • importing data into CockroachDB that is ordered by the primary key is much faster than importing data in other orders
  • if importing into CockroachDB from a GCP bucket, make sure to use v20.1.2 or greater since any version before that does not have retry logic if the connection between CockroachDB and the bucket fails (which seems to happen intermittently when cockroach is under high CPU load)

Details

In order to perform the migration from Postgres to CockroachDB, the high level steps that needed to occur in under 4 hours of downtime were:

  1. export the schema and data from Postgres
  2. upload the data to a GCP storage bucket
  3. import the schema and exported data into CockroachDB from the GCP bucket

All of my infrastructure was running in Google Cloud, so I made sure to run all of the steps in the same region to reduce network latency. Google Cloud states in their docs “locations within regions tend to have round-trip network latencies of <1ms in the 95th percentile”, which is blazing fast so I made sure to capitalize on that.

My database migration only included one large table. I started out by trying to migrate the database usingpg_dump :

I found that CockroachDB is not fully compatible with a full Postgres dump that contains the schema and data. In the docs, it describes the edits that need to be made to be compatible. Since my dump file was 375gb it was very difficult to edit. Instead I manually added the schema to a file with the necessary edits, then appended a data-only dump and to the same file after. For example, my schema was a little something like this:

Java
 




x


 
1
// dump.sql
2
CREATE TABLE example (    
3
    id INT PRIMARY KEY, 
4
    name TEXT
5
);


My First Attempt

My first attempt to import was with these steps:

Java
 




xxxxxxxxxx
1


 
1
// data only dump, took 32 mins to dump 375gb
2
$ pg_dump -h $host -U $user -d $db -a -W --format=p >> dump.sql// upload the dump to a gcp storage bucket
3
// took 45 mins for a 375gb file when uploading from a VM that
4
// was in the same gcp region as the bucket
5
$ gsutil cp dump.sql gs://postgres-dump-west2-2020-05-05// from the cockroach shell import from the gcp bucket
6
$ cockroach sql --url 'postgres://$user@$host/$db'$ IMPORT PGDUMP 'gs://postgres-dump-west2-2020-05-05/dump.sql?AUTH=specified&CREDENTIALS=<redacted>' WITH skip_foreign_keys;// check the status of the import
7
$ cockroach sql --watch 60s \
8
    --url 'postgres://$user@$host/$db' \
9
    --execute "SELECT description, fraction_completed FROM [SHOW JOBS] WHERE job_type = 'IMPORT';";


The first problem I encountered was that the import was not showing up in the SHOW JOBS command. At first I theorized that the import job wasn’t showing up yet because each node in the CockroachDB cluster needed to download the dump.sql file from the bucket, which I knew should take about 45 mins. I let the import run for over an hour, but still the import job did not show up in SHOW JOBS. I ended up going on the CockroachDB slack channel to ask for help. I found out some great information...

It turns out that if you import a dump file that contains the schema and also the data, CockroachDB first fetches/parses the input to find the schema definition during planning, then after that the job will be created. Then once the job is created, it will download again to import the data. For a small dumpfile, this extra fetch isn’t too bad, but since my dumpfile was 375 gb, it was super slow.

Since I was only importing a single table, to solve double fetch problem, I could instead specify the schema in the import command. Then the dumpfile would only used for the row data which would create a job right away and only need to fetch the file once:

Java
 




xxxxxxxxxx
1


 
1
$ cockroach sql --url 'postgres://$user@$host/$db'$ IMPORT TABLE example ( id INT PRIMARY KEY, name TEXT) PGDUMP DATA ('gs://postgres-dump-west2-2020-05-05/dump.sql?AUTH=specified&CREDENTIALS=<redacted>')


However another limitation I learned of from the CockroachDB slack channel is that the dumpfile is the unit of work performed by a node in the CockroachDB cluster. This means that if there are 3 nodes in the cluster, but I’m only importing one dumpfile, then only 1 node will do all the work, where the other two nodes will be sitting doing nothing. Meaning there is no parallelism within a single file being imported, although there is an issue to address it. In my case, I could have attempted to export the data into separate files, but I didn’t want to deal with the extra complication unless I needed to.

In the cases where there are many tables being imported, it would be best to dump each table in a separate file and perform separate import commands for each one to make use of the parallel work.

I also learned the following tips from the CockroachDB slack:

  • the pgdumpparser is one of the slower/higher overhead parsers since it has to fire up a full lexer/parser. Where pgcopy and CSV importers are a bit better (from my personal experience the CSV importer was about 40% faster than the pgdump importer)
  • if you pgdump data ordered by primary key, that would make the ingesting part of the IMPORT much faster on the CockroachDB side
  • the CSV importer is the most tuned at the moment so it may be faster although there aren’t any formal benchmarks to confirm

My Second Attempt

From all this extra info, I decided to modify my migration strategy to the following:

  1. manually create the table and schema in CockroachDB
  2. export the data from Postgres ordered by primary key and write to a CSV
  3. upload the CSV dump to a GCP bucket
  4. use the CockroachDB CSV importer to ingest data
Java
 




xxxxxxxxxx
1


 
1
// connected to postgres shell
2
$ psql -h $host -U $user -d $db// copy the entire table ordered by primary key and write to a csv
3
=> \copy (SELECT * FROM example ORDER BY id) TO 'dump.csv' WITH CSV


However exporting with ORDER BY was incredibly slow, writing about 1GB/min, compared to 11GB/min with pgdump. If I removed the ORDER BY clause it was improved up to about 10GB/min.

Importing into CockroachDB with data ordered by primary key was much faster, about 2GB/min, compared with 0.5GB/min with default ordering.

Since exporting data ordered by primary key was way too slow, I instead exported in the default ordering. Then I used the bashsort command to order it by primary key which turned out to sort about 6GB/min, much better than having Postgres order it at 1GB/min. As a side note, sort was limited by disk IO, so to speed it up I could have increased disk throughput.

 
export/import rates

My Third Attempt

Overall these are the steps I found to be sufficient to do my migration in under 4 hrs (coming in at 3.5 hrs total):

Java
 




xxxxxxxxxx
1
13


 
1
// connect to postgres shell
2
$ psql -h $host -U $user -d $db// copy the entire table and write to a csv
3
=> \copy (SELECT * FROM example) TO 'dump.csv' WITH CSV// sort the exported data by primary key
4
$ sort --parallel=8 --field-separator=',' dump.csv > sorteddump.csv// upload sorted csv to a gcp bucket
5
$ gsutil cp sorteddump.csv gs://bucket// log into cockroachDB shell
6
$ cockroach sql --url 'postgres://$user@$host/$db'// create table schema manually
7
CREATE TABLE example (
8
    id INT PRIMARY KEY,
9
    name TEXT
10
);// import into cockroachDB
11
IMPORT INTO example CSV DATA (
12
    'gs://bucket/sorteddump.csv?AUTH=specified&CREDENTIALS=redacted'
13
);


Alternative Options for Migrating

If these steps were not fast enough to do the entire migration under 4 hrs, I was considering two other options:

  1. write a script to do the export in parallel and export the data from Postgres into many files like so, then I could have also imported in parallel:
Java
 




xxxxxxxxxx
1


 
1
$ \copy (
2
    SELECT * FROM example ORDER BY id limit x offset 0
3
  ) TO 'dump1.csv' WITH CSV$ \copy (
4
    SELECT * FROM example ORDER BY id limit y offset x
5
  ) TO 'dump2.csv' WITH CSV...


2. implement some tooling for change data capture where I could snapshot Postgres, import into CockroachDB, then get CockroachDB up to date by reading the missing records from Postgres’ WAL

However both of these options would have required more work so I’m happy the base import support in CockroachDB was sufficient for my needs.

Miscellaneous Tidbits

That last two random bits of info I learned that were helpful to me:

  1. Two hours in on one of my import jobs it failed with this error:
Java
 




xxxxxxxxxx
1


 
1
$ IMPORT INTO example CSV DATA (
2
    'gs://bucket/sorteddump.csv?AUTH=specified&CREDENTIALS=redacted'
3
);error: reading CSV record: tcp read: connection reset by peer


I didn’t know which connection it was referring to and I thought it might be between my client and the cockroach shell. Once again I turned to the CockroachDB slack channel for help. Turns out that this error was likely between a cockroach node and the GCP bucket where if the connection closes cockroach doesn’t retry. Additionally it seems these errors are more likely when the node is at 100% CPU under heavy load .However there is this issue that just got merged, to handle these types of errors, but wasn’t released as of v20.1.1.

2. Once you start the import into CockroachDB and the job shows up in SHOW JOBS or in the admin UI, your client no longer need to maintain a connection to the database, it will keep running the job on its own.

Conclusion

I’ve really enjoyed my time exploring CockroachDB. So far I like the product and I love the Cockroach Cloud offering.

Cockroach seems to be working hard at actively making improvements so it seems in the near future the bulk import process should be very smooth.

Here are some of the issues in the pipeline that address the tricky areas I encountered:

  • Issue to fix the missing job during schema extraction
  • Issue that lists some improvements to performance of importing data out-of-order
  • Issue for parallelization of single file import
  • PR that fixes retries to google cloud storage access
Topics:
cockroachdb, database migration challenges, nosql, postgres database, relational database management system, sql (structured query language)

Published at DZone with permission of Jessica G . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}