Today’s World Calls for a New Kind of Database
Considering a converged approach to datastores and databases to reduce data sprawl and complexity, and the role of purpose-based databases for application development.
Join the DZone community and get the full member experience.
Join For FreeOver the past decade, applications have become more and more data-intensive. Dynamic data, analytics, and models are now at the core of any application that matters. To support these requirements, there is a commonly held, but often incorrect, belief that modern applications need to be built on top of a variety of special-purpose databases, each built for a specific workload. It is said that this allows you to pick the best ones to solve your application needs.
This trend is apparent when you look at the plethora of open-source data tools that have proliferated in recent years. Each one was built to scratch an itch, optimized for specific, narrow use cases seen in a smattering of projects. In response, some of the cloud vendors have packaged up these multiple database technologies for you to choose from, commonly forking from existing open-source projects. You’re then meant to wire together several of these tools into the needed data solution for each application.
On the surface, the argument seems logical. Why bother building or using general-purpose technology across multiple problem domains, maybe having to work around limitations that come from being a general-purpose solution, when you can use multiple tools purpose-built for each of the specific problems you are trying to solve?
The big cloud vendors are clearly all in on this approach, as they offer multiple types of databases. (Among the fifteen databases service offered by AWS, there are eight operational databases, and that is not even including the analytics data warehouse systems like Redshift and Athena).
This is not to say that there is no value in each of the special-purpose database offerings. There are certainly use cases where those special-purpose databases shine and are truly the best choice for the use case. These are cases for which the requirements in one specific dimension are so extreme that you need something special-purpose to meet them.
But the absolute requirement for specialty databases is mostly in outlier use cases, which are a tiny fraction of the total of workloads out there. In the vast majority of apps that people build, the requirements are such that they can be satisfied by a single, operational NewSQL database – a distributed relational database, supporting a mix of transactional and analytical workloads, multi-model, etc. This is especially true when you find you need more than one special-purpose database in your solution or when your requirements are expected to change over time.
The burden of choice has always been the dilemma of the software engineer. It used to be that the choice was whether to buy an existing component or to build it yourself. You made the trade-off between the dollar cost to purchase the component – and the risk it might not be as good as you hoped – vs. the cost, in time and engineering resources, to build and maintain a custom solution.
Most experienced engineers would likely agree that it is generally better to buy an existing component if it can meet the requirements. The cost to build is always higher than you think, and the cost to work out issues and maintain the solution over time often dwarfs the initial cost. In addition, having someone to call when something breaks is critical for a production system with real customers.
But then things changed.
Choices Have Ballooned With Open Source and the Cloud
The emergence of open-source software has fundamentally changed the build vs. buy choice. Now, it is a choice of build, buy – or get for free. And people love free.
Most engineers who use open source don’t really care about tinkering with the source code and submitting their changes back to the code base or referring to the source code to debug problems. While that certainly does happen (and kudos to those who contribute), the vast majority are attracted to open source because it is free.
The availability of the internet and modern code repositories like GitHub have made the cost to build software low, and the cost to distribute software virtually nothing. This has given rise to new technology components at a faster rate than ever seen before. GitHub has seen massive growth in the number of new projects and the number of developers contributing, with 40 million contributors in 2019, 25% of whom are new, and 44 million repositories.
On the face of it, this seems great. The more components that exist, the better the odds that the one component that exactly matches my requirements has already been built. And since they are all free, I can choose the best one. But this gives rise to a new problem. How do I find the right one(s) for my app?
Too Many Options Make Selection Challenging
There are so many projects going on that navigating the tangle is pretty difficult. In the past, you generally had a few commercial options. Now, there might be tens or hundreds of options to choose from. You end up having to narrow it down to a few choices based on limited time and information.
Database technology, in particular, has seen this problem mushroom in recent years. It used to be you had a small number of choices: Oracle, Microsoft SQL Server, and IBM DB2 as the proprietary choices, or MySQL if you wanted a free and open-source choice.
Then, two trends matured: NoSQL and the rise of open source as a model. The number of choices grew tremendously. In addition, as cloud vendors are trying to differentiate, they have each added both NoSQL databases and their own flavors of relational (or SQL) databases. AWS has more than 10 database offerings; Azure and GCP each have more than five flavors.
DB-Engines (a site for tracking the popularity of database engines) has more than 300 databases on the list, with new ones getting added all the time. Even the definition of what a database is has evolved, with some simple data tools such as caches marketing themselves as databases. This is making it difficult to know, without a lot of research, whether a particular technology will match the requirements of your application. Fail to do enough research, and you can waste a lot of time building a data technology, only to find it has some important gap that tanks your design.
What to Consider in Choosing a Database
There are many different flavors of databases on the list. Operational databases and data warehouses are the most common types, but there are several more. Each has a set of requirements that they solve.
Database Types |
Requirements |
Operational Databases Oracle, SQL Server, Postgres, MySQL, MariaDB, AWS Aurora, GCP Spanner, SingleStore |
|
Data Warehouses Teradata, Netezza, Vertica, Snowflake, SingleStore |
|
Key-Value Stores Redis, GridGain, Memcached |
|
Document Stores MongoDB, AWS DocDB, AWS DynamoDB, Azure Cosmos DB, CouchDB |
|
Full-Text Search Engines Elasticsearch, AWS Elasticache, Solr |
|
Time Series: InfluxDB, OpenTSDB, TimescaleDB, AWS Timestream |
|
GraphDB: Neo4j, JanusGraph, TigerGraph, AWS Neptune |
|
Table 1. Fitting Your Prospective Application to Different Database Types
Every database excels in slightly different scenarios. And there are new specialty databases emerging all the time.
If you’re building a new solution, you have to decide what data architecture you need. Even if you assume the requirements are clear and fixed – which is almost never the case – navigating the bewildering set of choices regarding which database to use is pretty hard. You need to assess requirements across a broad set of dimensions, such as functionality, performance, security, and support options, to determine which ones meet your needs.
If you have functionality that cuts across the different specialty databases, then you will likely need multiple of them. For example, you may want to store data using a standard relational model but also need to do full-text queries. You may also have data whose schema is changing relatively often, so you want to use a JSON document as part of your storage.
The combination of databases you can use in your solution is pretty large. It’s hard to narrow that down by just scanning the marketing pages and the documentation for each potential solution. Websites cannot reliably tell you whether a database offering can meet your performance needs. Only prior experience or a proof of concept can do that effectively.
How Do I Find the Right People?
Once you have found the right set of technologies, who builds the application? You likely have a development team already, but the odds of them being proficient in programming applications on each specific, new database are low.
This means a slower pace of development as they ramp up. Their work is also likely to be buggier as they learn how to use the system effectively. They also aren’t likely to know how to tune for optimal performance. This affects not just developers but the admins who run, configure and troubleshoot the system once it is in production.
How Do I Manage and Support the Solution With So Many Technologies?
Even after you pick the system and find the right people, running the solution is not easy. Most likely, you had to pick several technologies to build the overall solution. That probably means that no one in your organization understands all the parts.
Having multiple parts also means you have to figure out how to integrate all the pieces together. Those integration points are both the hardest to figure out and the weakest points in the system. They are often where performance bottlenecks accumulate. They are also sources of bugs and brittleness, as the pieces are most likely not designed to work together.
When the solution does break, problems are hard to debug. Even if you have paid for support for each technology, which defeats the purpose if you’re using things that are free, the support folks for each technology are not likely to help figure out the integration problems. (They are just as likely to blame each other as to help you solve your problem.)
The Takeaway
Going with multiple specialty databases is going to cost you in time, hassle, money, and complexity:
Investigation analysis. It takes a lot of energy and time to interrogate a new technology to see what it can do. The number of choices available is bewildering and overwhelming. Every minute you spend doing the investigation slows your time to market.
Many vendors. If you end up choosing multiple technologies, you are likely to have different vendors to work with. If the solution is open source, you are either buying support from a vendor or figuring out how to support the solution yourself.
Specialized engineers. It takes time and experience to truly learn how to use each new data technology. The more technology you incorporate into your solution, the harder it is to find the right talent to implement it correctly.
Complicated integrations. The most brittle parts of an application are the seams between two different technologies. Transferring data between systems with slightly different semantics, protocols that differ, and connection technologies with different scale points are the places where things break down (usually when the system is at its busiest).
Performance bottlenecks. Meshing two different technologies is also where performance bottlenecks typically occur. With data technologies, it is often because of data movement.
Troubleshooting integration problems. Tracking down and fixing these issues is problematic, as the people doing the tracking down are rarely experts in all the technologies. This leads to low availability, frustrated engineers, and unhappy customers.
Ideally, there would be a familiar database infrastructure, which has an interface that most existing developers know how to use and optimize; and which has the functionality needed to handle 90% or more of the use cases that exist. It would need to be cloud-native, meaning it natively runs in any cloud environment, as well as in an on-premises environment, using cloud-friendly platforms such as Kubernetes.
This ideal technology would also be distributed so that it scales easily, as required by the demands of the application. This database would be the default choice for the vast majority of applications, and developers would only need to look for other solutions if they hit an outlier use case. Using the same database technology for the vast majority of solutions means the engineers will be familiar with how to use it and able to avoid the issues listed above.
Legacy databases like Oracle and SQL Server served this function for a long time. But the scale and complexity requirements of modern applications outgrew their capabilities.
This gave rise to a plethora of NoSQL systems that emerged out of the need to solve the scale problem. (I discuss this in my blog about NoSQL and relational databases.) But the NoSQL systems gave up a lot of the most useful functionality, such as structure for data and SQL query support, forcing users to choose between scale and functionality.
NewSQL systems allow you to have the best of both worlds. You get a highly scalable, cloud-native system that is durable, available, secure, and resilient to failure, but with an interface familiar to developers and admins. In selecting a NewSQL system, look for a solution that supports a broad set of functionality. While obviously, ANSI SQL is a must, it should support all the major data storage types – relational, semi-structured (i.e., JSON), geo-spatial, graph, and full text. It should natively ingest from the most common data sources (object stores like S3, queues like Kafka, as well as other relational databases). And it should support natively processing any of the standard data formats, such as JSON, AVRO, and Parquet.
Conclusion
Some cloud providers claim that you need to have eight different purpose-built databases, if not more, to build your application and that it is impractical for one database to meet all of the requirements for an application.
I respectfully disagree. While there are some outlier use cases that may require a specialty database, the vast majority of applications can have all their key requirements satisfied by a single NewSQL database.
Even if you can build your solution using a number of special-purposes databases, the cost to investigate, build, optimize, implement, and manage those systems will outweigh any perceived benefit.
Keep it simple. Use a consolidated database that meets your requirements.
Published at DZone with permission of Rick Negrin, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments