NoSQL does not mean no migrations (but opens up new ways of doing them)
Join the DZone community and get the full member experience.
Join For FreeMongoDB does not require a schema: you can just throw heterogeneous documents to a non-existing collection, and it will be created and store all your data.
After an year of using MongoDB in production, how does this design choice affect the development of new code and the maintenance of the existing one?
No homogeneity
By homegeneous and heterogeneous documents we can for example think of two documents with the same set of fields (and different values of course) or with additional fields that do not exist in the other. NoSQL promotes heterogeneity, and you must know there is no way by design to encorce a common "schema" to the millions of documents in each collection.
This does not mean you should be working with few collections: since new collections are cheap and do not require a schema, often you can create multiple ones where you would create a single SQL table, as long as you don't need a query to hit more than one of them.
No migrations?
Besides the lack of overhead in defining a set of columns beforehand, the lack of schema enforcement is a feature of MongoDB: it lets you migrate documents from one version to another in a production database without interrupting availability, by deploying code that works with both versions of the schema and subsequently start the migration that will eventually transform all documents.
There are multiple ways of performing this migration, that may differ from a migration task to be executed during the build. For example, it may be the code itself that picks up a document in the old version the first time it is needed, but save it only in the new one. This kind of migration is slow-going and puts little load on the database; you can even interrupt it with a rollback in case of disasters.
However, the down side is that you can actually find in production documents which are the result of earlier tests or have a different structure. So this code may break:
db.collection_name.find().forEach(function(d) { // do something with d.my_field });
even if the 99.9% of the documents contain the field my_field (the JS console will return an undefined value, but programming language bindings and your own code may raise exceptions when dealing with missing results; NullPointerExceptions are just a method call away.) Test coverage must be adjusted accordingly.
It could be interesting to look for tools to visualize the implicit schemas of the documents in a collection, but they can be impossible to use due to their linear scans (but you perform that only once, so that won't be costly). Moreover, difference in schemas are not only related to the field set.
No bounded field values
db.collection_name.distinct('field_name')
How many different values does this query return? Chances are that it may pick up old values which are not written anymore by your codebase (but can be read, for consistency).
Again to be fair, this is a feature: it means that the purchases we have made in 2012 are still "logged_in_by": "SystemA" and in 2013 they are "logged_in_by": "SystemB", even if SystemA does not exist anymore and System B didn't exist back then, so there's no valid value I would migrate 2012 documents to.
This means you should take a different approach to expose enum values: while these values are embedded in a schema, they may be produced by a distinct() query in a NoSQL database like Mongo, maybe denormalized into a new collection.
No indexes
Indexes are usually part of a schema in SQL databases, while MongoDB has an implicit schema. There is fundamentally an operation that has to be called once: the index creation; this operation is not usually put inside the code accessing or writing the data but inside migration scripts.
MongoDB tries to provide a schemaless approach that comprehends indexes: its ensureIndex() collection methods can be called multiple times, triggering an index creation only on the first time. It's a kind of lazy-loaded index.
However, by default this indexing operations are not executed in background, but are synchronous and only return after the index has been completed. Since usually indexes are added after the data has been online for quite some time, depending on the queries which need optimization, this operation may be costly and you may want to perform it offline, before slowing down requests to your application because of a task that can be performed outside of the user-oriented processes.
Conclusion
Being schemaless doesn't mean you will never have to migrate data between the implicit schemas with which data is represented inside NoSQL databases such as MongoDB. Even when leaving in place code that works with multiple versions of data and provide a smooth transition between the old and new formats, indexes provide an example of operation to be performed once and for all and that is mostly put outside of the code the database supports.
Opinions expressed by DZone contributors are their own.
Comments