When you're working on modern and fresh applications, one of the goals of their development is usually to replace their legacy equivalent. Last week I performed an import of this sort, from legacy to new application, and I wanted to share how this process was different from the usual dumps that I've seen in action in the past.
A database schema is not an API
When you're talking about importing and exporting data from a PHP application, at least in the field of web development, you refer to acting directly on the database to extract data from a running application, maybe in a staging environment:
mysqldump -h ... -d database_name ...
to the subsequent upload of a big file:
scp dump.sql.gz production.example.com:
and to its restoring inside of the real database:
mysql -h ... < dump.sql
However, the database-oriented approach has several problems that make it viable only for web sites, more than for web applications.
First of all, it is based on the schema as an integration point, which is a poor choice as it's an implementation detail of the application more than a published interface with which to interact from a client. Keeping an import client in sync with a changing database schema is difficult; what is even worse is not being able to change your schema because code not under your control has become dependent on it.
Direct integration with a schema is also imprecise in the cases where the schema is implicit: in MongoDB 2, 2.0 and "2.00" are three very different values and importing one instead of the other will fail your queries.
Extending the previous issues, the database nowadays is not the full state of an application, even in shared nothing environments where servers are immutable and store no local files. S3 objects are an example, but also any other external web service with which you integrate needs to be populated together with the db. When your application is composed of several services, the same applies as your databases are on different machines.
An import API
Here's how the import API I have written to avoid interacting with databases directly look like.
Under the covers, it aligns three different components of a distributed systems:
curl -X PUT -d 'service=42¬ify_url=example.com&...' http://onebip.com/imported-subscription/12345678901234567890abcd
It is an Aggregate-based API, in the DDD meaning: it imports a single Aggregate like a Subscription or a User or a ForumPost, but it does not publish the internal structure of it. More precisely, the API imports an Aggregate for each Bounded Context involved, dividing the data between them and regenerating the missing fields so that the required parameters are minimal (10 fields instead of the total 40 that would be needed with a database migration script).
Importing through an Aggregate-oriented API is slower than with a database script, but more reliable: the input data can be normalized to the right format (especially in the case of NoSQL databases which do not have strong typing information in the schema and where a query matching the value 2 won't match "2". It is also more robust as Aggregates are independent from each other: what if the import of a single row fails in the middle of your big .sql script? How do you recover from it, abort the transaction and rerun the whole import all over again?
The importing scripts (that call the API) must log what they do and the result of their operations, so that failed imports can be dealt with later.
Given the nature of this import, you can easily take advantage of parallelization: last week, I wrote the client script to accept an offset and limit parameter; its work could be divided like this:
./import.php 0 100 ./import.php 100 100 ./import.php 200 100 ...
and I've run 30 workers in parallel with this command:
cat chunks.txt | xargs -n 2 -P 30 ./import.php
where chunks.txt is a generated file of the form:
0 100 100 100 200 100 ...
The number of the workers is limited only by the throughput of the slowest of the distributed systems involved in the import. The number of served requests for one of the three ones where I was importing leveled up with 30+ workers, and so it was no use increasing the load in the hope for better performance. If you are on a cloud environment, it's aesy to think about starting more servers when importing to increase the acceptable load, and shutting them down afterwards.
Don't bulk import into a database directly: consistency cannot be guaranteed just by a schema. Use the API of your application, or take this as an excuse to create one. It will be reusable for other cases, preventing damages to data because of imprecisions, especially for NoSQL-backed applications. Performance will be a concern, but parallelizing the import into different Aggregates it's the way to support infinite scalability.