Faster Bulk Loading in Postgres With Copy
Faster Bulk Loading in Postgres With Copy
Postgres \copy is a mechanism for you to bulk load data in or out of Postgres. But how exactly does it work? Take a look in this detailed tutorial.
Join the DZone community and get the full member experience.Join For Free
Built by the engineers behind Netezza and the technology behind Amazon Redshift, AnzoGraph™ is a native, Massively Parallel Processing (MPP) distributed Graph OLAP (GOLAP) database that executes queries more than 100x faster than other vendors.
If you've used a relational database, you understand basic
INSERT statements. Even if you come from a NoSQL background, you likely grok inserts. Within the Postgres world, there is a utility that is useful for fast bulk ingestion:
\copy is a mechanism for you to bulk load data in or out of Postgres.
First, let's pause. Do you even need to bulk load data and what's it have to do with Citus? We see customers leverage Citus for a number of different uses. When looking at Citus for a transactional workload, say as the system of record for some kind of multi-tenant SaaS app, your app is mostly performing standard insert/updates/deletes.
But when you're leveraging Citus for real-time analytics, you may already have a separate ingest pipeline. In this case, you might be looking at event data, which can be higher in volume than a standard web app workload. If you already have an ingest pipeline that reads off Apache Kafka or Kinesis, you could be a great candidate for bulk ingest.
Back to our feature presentation: Postgres
\copy. Copy is interesting because you can achieve much higher throughput than with single row inserts.
- Can be executed from within
psqlor an application language.
- Supports CSV as well as binary data.
- Is transactional.
A Look at Postgres \copy Performance for Bulk Ingest
Let's take a look at both how Postgres copy performs over single row inserts, then we'll move onto how you can leverage
\copy. There are essentially three primary ways to scale out your writes in single-node Postgres. And if you need to scale out ingest even further beyond a single node you can look to Citus as each node in a Citus database cluster is able to help you at scaling writes.
Starting Simple With Inserts Using Sample GitHub Event Data
First let's see how we perform on a standard insert.
To perform this test, I created a set of 1,000,000 inserts from the GitHub event dataset. I then connected to Postgres with
psql and ran
\i single_row_inserts.sql. This command executed all the insert queries.
The result: It took 15 minutes 30 seconds to load up 1 million events records. Or right at 1,075 inserts per second on a small-size Postgres instance. Yes, I could have had a few more writers going at once and further tuned my test, but this gives us a starting baseline to compare against.
Performance Tuning Our Inserts
It is of note here that each insert is a transaction. What this means is Postgres is doing some extra coordination to make sure the transaction is completed before returning. On every single write this takes some overhead. Instead of single row transactions, if we wrap all of our inserts in a transaction like below, we'll see some nice performance gains:
begin; insert 1; insert 2; insert 3; ... commit;
This took my inserts down from 15 minutes 30 seconds to 5 minutes and 4 seconds. We've suddenly boosted our throughput by 3x to about 3k inserts per second.
Bulk Ingest With Even Greater Performance ft. Postgres \copy
By batching our inserts into a single transaction, we saw our throughput go higher. But hold on, there is even more we can do. The
\copy mechanism gives a way to bulk load data in an even more performant manner. Here instead of raw insert statements, we need the data in some CSV format. Copy accepts a number of delimiters and even forms of binary data as well, but a standard CSV should work. Once we have our CSV of events we can load it up from within Postgres:
\copy github_events from 1millionevents.csv WITH csv;
\copy completes in 82 seconds! We're now processing over 10k writes per second on some fairly modest hardware.
Okay, so those of you who are frequent repeaters of our Citus Data blog know that I love
psql and the command line utilities in Postgres, but the CLI may not be helpful directly in your application. Fortunately, the copy utilities are available in most languages. Here's the quick rundown for a few popular languages:
Leveraging Bulk Loading in Ruby
Within Ruby, you'll need to be using the standard
pq driver or
ActiveRecord. The following gives you a rough sketch for establishing a connection from within ActiveRecord and loading from CSV on disk:
conn = ActiveRecord::Base.connection rc = conn.raw_connection rc.exec("COPY my_table (col1, col2, etc) FROM STDIN WITH CSV") file = File.open('myfile.csv', 'r') while !file.eof? rc.put_copy_data(file.readline) end rc.put_copy_end
Leveraging Bulk Loading in Python
Of course anything Ruby can do, Python can do (better!). In Python, leverage a similar process as we did in ruby. With this code example, we're going to generate a CSV live though as opposed to reading one from disk. In production, in both Ruby and Python, you could parse the CSV out from disk, or generate it live from some other part of your pipeline like Kafka/Kinesis.
conn = psycopg2.connect(DSN) curs = conn.cursor() data = StringIO.StringIO() data.write('\n'.join(['1', '2','3'])) data.seek(0) curs.copy_from(data, 'my_table')
Bulk ingestion with
\copy is great for a lot of workloads and doesn't require you to load up millions of record CSVs either. You can start with micro-batches of 100, 1000, or 10000 records and still see significant performance gains.
Published at DZone with permission of Craig Kerstiens , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.