Have you ever needed to write a query that combined data from two PostgreSQL databases? Our client Healthify has.
Since Healthify works with healthcare and stores some patient information, they’re very careful not to send any personally identifiable information to third party services.
One such service is Segment, which Healthify uses to track user activity on the site. Conveniently, Segment can send your data to a PostgreSQL data warehouse, so Healthify has two PostgeSQL databases full of useful information:
- Their app database, which their Rails app uses, and
- a reporting database, with anonymised data from Segment.
To better understand how people are using the application, we wanted to answer some questions that needed data from both sources. We wanted to filter the reporting data using attributes from the app data, without having to pass those attributes to Segment. This blog post explains what we did.
Foreign Data Wrapper
PostgreSQL has a useful feature called Foreign Data Wrapper, which lets you create foreign tables in a PostgreSQL database that are proxies for some other data source. When you make a query against a foreign table, the Foreign Data Wrapper will query the external data source and return the results as if they were coming from a table in your database.
There are two foreign data wrappers that ship with PostgreSQL:
file_fdwto create foreign tables that represent flat files (Postgres 9.1 and later), and
postgres_fdwto create foreign tables that represent tables in another PostgreSQL database (Postgres 9.3 and later).
You can also define your own wrapper, or use a third-party wrapper (we’ve used
tds_fdw on other thoughtbot projects), but for Healthify’s situation
postgres_fdw was exactly what we needed.
Before we could create foreign tables in our reporting database to proxy the tables in our app database, we needed to do a bit of setup in the reporting database.
- Install the
CREATE EXTENSION postgres_fdw;
- Create a server:
CREATE SERVER app_database_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'postgres.example.com', dbname 'my_app');
- Create a user mapping, which defines the credentials that a user on the local server will use to make queries against the remote server:
CREATE USER MAPPING FOR CURRENT_USER SERVER app_database_server OPTIONS (user 'reporting', password 'secret123');
Since we’re using this for reporting, we chose to create a separate user in our app database specifically for Foreign Data Wrapper to use. That user only has read permissions, so a badly written reporting query can’t accidentally
Now we’ve told Foreign Data Wrapper where the server is, and the credentials it can use to connect, we’re ready to define the foreign tables.
We could use
CREATE FOREIGN TABLE for this, but we’d end up specifying all of the fields for each table we want to access. We’d effectively end up repeating our whole schema. Fortunately, Postgres 9.5 introduced the
IMPORT FOREIGN SCHEMA command:
CREATE SCHEMA app; IMPORT FOREIGN SCHEMA public FROM SERVER app_database_server INTO app;
This will create foreign tables for all of the tables from our app database’s
public schema into our reporting database’s
A schema in PostgreSQL is just a namespace for grouping tables. If you haven’t explicitly specified a schema then you’re implicitly using the
We could import from the app database’s
public schema into the reporting database’s
public schema, but keeping all of the foreign tables in a separate schema means we don’t have to worry about naming collisions.
Segment also recommends that if you’re going to put any additional tables in a PostgreSQL database that you’re using as a Segment warehouse you should make sure it’s in a separate schema:
If you want to insert custom data into your warehouse, create new schemas that are not associated with an existing source, since these may be deleted upon a reload of the Segment data in the cluster.
Now that we’ve imported our schema, we can query the foreign tables as if they were normal, local tables.
SELECT COUNT(*) FROM app.users;
We can even
JOIN between a local table and a foreign table.
SELECT COUNT(*) FROM production.events JOIN app.users ON production.events.user_id::INT = app.users.id WHERE app.users.company_id = 10;
It Ain’t All Good
Handling Schema Changes
When we ran
IMPORT FOREIGN SCHEMA we created foreign tables based on the current structure of the app database. If the app database changes —if we add a table, or drop a column — our foreign tables will need updating too.
To work around this we added some code to update our reporting database’s foreign tables every time we run the migrations on the production app.
Rake::Task["db:migrate"].enhance do if ENV["UPDATE_FOREIGN_TABLES_AFTER_MIGRATE"] Rake::Task["db:export_schema_for_reporting"].invoke end end
UPDATE_FOREIGN_TABLES_AFTER_MIGRATE environment variable lets us control which environments use this task. It’s always going to be useful in production, but rarely in development.
db:export_schema_for_reporting task runs all of the queries we’ve previously seen, slightly modified to make them idempotent:
BEGIN; CREATE EXTENSION IF NOT EXISTS postgres_fdw; DROP SERVER IF EXISTS app CASCADE; CREATE SERVER … ; CREATE USER MAPPING … ; DROP SCHEMA IF EXISTS app; CREATE SCHEMA app; IMPORT FOREIGN SCHEMA … ; COMMIT;
Rebuilding the server config and foreign tables every time we migrate keeps our reporting server up-to-date. Wrapping the whole thing in a transaction ensures queries against the foreign tables that are run while we’re running the
db:export_schema_for_reporting task will be blocked until the task finishes, but won’t fail.
There were some fringe benefits from running this from our Rails application: we can work out what arguments to pass to
CREATE SERVER by reading our application’s database configuration.
Once we had everything set up we learnt that we couldn’t always treat foreign tables exactly like local tables. In particular, the performance of queries that
JOIN over multiple foreign tables was very poor.
Behind the scenes, PostgreSQL was making separate queries to the app database and then
JOINing them after the fact in the reporting database.
JOINs as sub-queries improved things significantly.
Now that Postgres 9.6 has been released, this should be less of a problem. The Postgres 9.6 release notes say:
postgres_fdwnow supports remote joins, sorts,
See the original here: PostgreSQL's Foreign Data Wrapper.