Databases are a lot like programming languages: there’s a great variety to chose from, they have wildly different programming and query interfaces, and there’s a wide gap in available features. And just as choosing the right programming language for a project can have a significant impact on the outcome of success, so does choosing the right database. So what really matters when choosing a database?
Absolute Raw Speed Does Not Matter
Surprisingly, raw speed of the database does not matter in 99% of cases. From an application perspective, database queries tend to have a bimodal distribution: they’re either good enough or too slow to complete. Usually, it’s more important to keep good enough performance while scaling concurrent access with growing dataset sizes over getting the best possible performance out of a narrow set of queries.
For example, look at Google’s F1 distributed database. The commit latency on F1 is quite high (at 50-100ms). Read latency takes a hit as well, with simple reads in the 5-10ms range. Compared with a single instance MySQL, the numbers are quite high. However, Google’s core ad business runs on F1, at a scale of 10s of TBs across 1000s of machines. Ability to keep good enough performance by scaling the system matters more.
Expressiveness of the Query Language Matters
Chocolate or vanilla? Beer or wine? SQL or NoSQL? All questions which inspire endless debates. However, most of the debates on database query languages tend to miss the mark. It’s possible to have a weak SQL implementation (e.g. lack of subqueries, limited aggregate functionality, poor planner, etc.) and a relatively feature rich non-SQL implementation (RethinkDB supports join operations between documents while MongoDB does not).
Consider some of the following criteria when evaluating a database query language:
- How many network trips do I need to make to the database for typical query patterns?
- For operations that require looking at a great deal of records, does the database support necessary aggregate/window/analytics functions without having to transfer the burden onto the application? Does it support concepts such as temporary tables when more complex/multi-stage evaluations are necessary?
- Can I leverage existing queries/work without having to duplicate queries (e.g. SQL views)?
While there’s nothing inherent to SQL that makes it a better fit, it turns out that most database implementations which rely on it offer a much richer and powerful interface than those that don’t. Perhaps in a few years some of the alternatives will catch up, but we are not there yet.
And finally, a subject of many flame wars: transactions. It’s true that many folks have built large scale systems which do not rely on the database’s support (or lack of) transactions. However, such systems typically move the well understood properties of transactions as offered by their databases into an ad-hoc implementation within the application itself. Which generally leads to a more complex and fragile codebase.
The Google Spanner designers said the following about transactions:
“We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions. Running two-phase commit over Paxos mitigates the availability problems.”