Addressing the NoSQL Criticism
There were quite a few NoSQL critics at OSCON this year. I imagine this was true of past years as well, but I don’t know that first hand. I think there are several reasons behind the general disdain for NoSQL databases.
First, NoSQL is horrible name. It implies that there’s something wrong with SQL and it needs to be replaced with a newer and better technology. If you have structured data that needs to be queried, you should probably use a database that enforces a schema and implements Structured Query Language. I’ve heard people start redefining NoSQL as “not only SQL”. This is a much better definition and doesn’t antagonize those who use existing SQL databases. An SQL database isn’t always the right tool for the job and NoSQL databases give us some other options.
Second, there are way too many different types of databases that are categorized as NoSQL. There are document-oriented databases, key/value stores, graph databases, column-oriented databases, in-memory databases, and other database types. There are also databases that combine two or more of these properties. It’s easy to criticize something that is vague and loosely defined. As the NoSQL space matures, we’ll start to get some more specific definitions, which will be much more helpful.
Third, at least one very popular vendor in the NoSQL space has a history of making irresponsible claims about their database’s capabilities. Antony Falco of Basho (makers of Riak) has a great blog post on the topic: See It’s Time to Drop the “F” Bomb – or “Lies, Damn Lies, and NoSQL.” If you care about your data, please read Tony’s blog post. It’s unfortunate that the specious claims of a few end up making everyone in the NoSQL space look bad.
I also want to address some of the specific criticisms that I’ve heard of NoSQL, as they apply (or don’t apply) to CouchDB (I’m not familiar enough with other NoSQL databases to talk about those).
SQL Databases Are More Mature
This is absolutely true. If you pick a NoSQL database, you should do your homework and make sure that your database of choice truly respects the fact that writing a reliable database is a very difficult task. Most of the NoSQL databases take the problem very seriously, and try to learn from those that have come before them. But why create a new type of database in the first place? Because an SQL database is not the right solution to every problem. When all you have is a schema, everything looks like a join. The data model in CouchDB (JSON documents) is a great fit for many web applications.
SQL Scales just fine
This is also true. If you’re picking a NoSQL database because it “scales”, you’re likely doing it wrong. Scaling is typically more aspiration than reality. There are many other factors to consider and questions to ask when choosing a database technology other than, “does it scale?” If you do actually have to scale, then your database isn’t going to magically do it for you. You can’t abstract scaling problems to your database layer. However, I will say that many NoSQL databases have properties (such as eventually consistency) that will make scaling easier and more intuitive. For example, it’s dead simple to replicate data between CouchDB databases.
Atomicity, Consistency, Isolation, and Durability (ACID)
CouchDB is ACID compliant. Within a CouchDB server, for a single document update, CouchDB has the properties of atomicity, consistency, isolation, and durability (ACID). No, you can’t have transactions across document boundaries. No, you can’t have transactions across multiple servers (although BigCouch does have quorum reads and writes). Not all NoSQL databases are durable (at least with default settings).
If you want the best possible guarantee of durability, you can change CouchDB’s delayed_commits configuration option from true (the default) to false. Basically, this will cause CouchDB to do an explicit fsync after each operation (which is very expensive and slow). Note that operating systems, virtual machines, and hard drives often lie about fsync, so you really need to research more about how your particular system works if you’re concerned about durability. If you think your write speeds are too good to be true, they probably are.
If you leave delayed commits on, CouchDB has the option of setting a batch=ok parameter when creating or updating a document. This will queue up batches of documents in memory and write them to disk when a predetermined threshold has been reached (or when triggered by the user). In this case, CouchDB will respond with an HTTP response code of 202 Accepted, rather than the normal 201 Created, so that the client is informed about the reduced integrity guarantee.
At least one NoSQL database requires a consistency check after a crash (guess which one). This can be a very slow process, causing additional downtime. CouchDB’s crash-only design and append-only files means that there is no need for consistency checks. There’s no shut down process in CouchDB—shutting it down is the same as killing the process.
CouchDB’s append-only files do come at a cost. That cost is disk space and the need for compaction. If you don’t compact your database, it will eventually fill up your hard drive. There is no automatic compaction in CouchDB. Compaction is triggered manually (it can easily be automated through a cron job) and should be done when the database’s write load is not at full capacity.
No Ad Hoc Queries
This is a feature, not a bug. CouchDB only lets you query against indexes. This means that queries in CouchDB will be extremely fast, even on huge data sets. Most web applications have predefined usage patterns and don’t need ad hoc queries. If you need ad hoc queries, say for business intelligence reporting, you can replicate your data (using CouchDB’s changes feed) to an SQL database.
Building Indexes is Slow
If you have a large number of documents in CouchDB, the first build of an index will be very slow. However, each query after that will be very fast. CouchDB’s MapReduce is incremental, meaning new or updated documents can be processed without needing to rebuild the entire index. In most scenarios, this means that there will be a small performance hit to process documents that are new or updated since the last time the view was queried. You can optionally include the stale=ok parameter with your query. This will instruct CouchDB to not bother processing new or updated documents and just give you a stale result set (which will be faster than processing new or updated documents). As of CouchDB 1.1, you can include a stale=update_after parameter with your query. This will return a stale result set, but will trigger an update of the index (if necessary) after your query results are returned, bringing the index up-to-date for future queries by you or other clients.
Some say that not having a schema is a problem. Sure—if you have structured data, you probably want to enforce a schema. However, not all applications have highly structured data. Many web applications work with unstructured data. If you’ve encountered any of the following, you may want to consider a schema-free database:
- You’ve found yourself denormalizing your database to optimize read performance.
- You have rows with lots of NULL values because many columns only apply to a subset of your rows.
- You find yourself using SQL antipatterns such as entity-attribute-value (EAV), but can’t find any good alternatives that fit with both your domain and SQL.
- You’re experiencing problems related to the object-relational impedance mismatch. This is typically associated with use of an object-relational mapper (ORM), but can happen when using other data access patterns as well.
I’ll add that you can enforce schemas in CouchDB through the use of document update validation functions.
Did I miss anything? What other criticisms exist of NoSQL databases? Please comment and I’ll do my best to address each.