Polyglot Persistence in Practice
Let's look at how modern applications store data in many different formats and locations and for different needs and usages.
Join the DZone community and get the full member experience.Join For Free
It used to be the case that your database choice was limited to which relational database vendor you use (and often, that was settled ahead of time). In the past fifteen years, a lot has changed, including the baseline assumption that you’ll have a single database for your application and it will be a relational one.
I’m not about to rehash the SQL vs. NoSQL vs. NewSQL debate in this article. Instead, I want to focus on practical differences between the various approaches and some of the ways that you can use the notion of polyglot persistence to reduce the time, effort, and pain it takes to build modern applications.
Let’s talk about the most classic of examples: the online shop. We have the customer, orders, and shipping as our main concerns. It used to be that you would start by defining a shared database schema among all these components. That led to many interesting issues. The way the Customers module works with a customer record and the way the Shipping module works with it are completely different.
It isn’t just a matter of different queries and the need to balance different indexes for each of the distinct parts of the system. If a customer changed her name, you don’t want to refer to her with the old name. On the other hand, a package that was sent to that customer before the name change needs to still record the name that we sent it as.
If you want to implement a recommendation system for the customers, you’ll quickly realize that structuring the data for each quarterly report is going to make your life… unpleasant. In short, a single database and a single schema to rule them all isn’t going to work (except in Mordor).
The obvious answer is to split things up — to use different schemas for different pieces of the system and allow them to build their own data models independently. A natural extension of that is to use the best database type for their needs. In this case, I’m not talking about selecting MySQL vs. Oracle but relational vs. document vs.graph vs. column store.
The biggest challenge in such a system — where you have multiple distinct data silos that operate in tandem — is how you actually manage, understand, track, and work with all the data. This is of particular importance for organizations that need to deal with regulations such as GDPR or HIPAA. Note that this isn’t a challenge that is brought upon by the usage of multiple database storage types. This is brought upon because the amount of complexity we have to deal with in our systems is so high that we have to break them apart into individual components to be able to make sense of them.
In the old days, data interchanges used to be done with flat files. If you were lucky, they were self-describing (CSV was an improvement over the status quo if you can believe that). In many cases, they were fixed length, arbitrary formats with vague specifications. Today, we have much better options available for us. We can utilize dedicated ETL tools to move data from one system to another, transforming it along the way.
One thing that hasn’t changed in all that time is the absolute requirement of understanding the flow of data in your organization. The recommended way to do that is to define an owner for each piece of data. The Orders service is the owner of all the orders’ data. It is the definitive source for all data about orders. If the recommendation engine needs to go through the past orders, it can only do that through published interfaces that are owned by the Orders service team.
A published interface doesn’t have to be a SOAP/REST API. It can be a set of files published to a particular directory, an API endpoint in which you can register to get callbacks when data changes, or an internally accessible database, and is meant for external consumption(and is distinct from whatever format the data is kept on internally).
Fifteen years ago, this was called service-oriented architecture; recently, it has become common to call it microservices. Regardless of what it’s actually called, the idea of separation of concerns and ownership of data is critical for successfully separating the different parts of the system into individual components.
Today, we can utilizededicated ETL tools tomove data from one systemto another, transforming italong the way.
Once clear boundaries have been established, it is much easier to start working on each of the different pieces independently. TheOrders service can use a document database, while the recommendation engine will likely use a graph. Analytics and reporting are handled by a column store, and the financial details are recorded in a relational database. Previously, when I talked about published interfaces between the components, I was careful to not use the term API. Indeed, in most cases, I would recommend against this. When we need a different part of the system to perform something for us, by all means, call to it. But when we need a piece of data, it is usually better to have it already available than to have to seek it out.
Consider a simple case where shipping needs to know whenever a customer has a preferred status so it can apply a different shipping charge to their orders. One way of doing this is to call, for each of the orders that shipping goes through, to the customers’ service and ask it about the customer’s status. This is a gross violation of service independence. It means that shipping cannot operate if the Customers service is not up and running.
Ideally, we want to have better isolation between them. A better approach is to store the relevant data locally to the Shipping service. This means that the customers’ preferred status may take a while longer to update in shipping’s database. At worst, that may mean that a customer will not get the preferred rate. That is something that can easily be fixed with a refund and is likely to be rare, anyway. The benefit of separating the systems, on the other hand, is an ongoing effect.
Beyond abstract ideals of architecture, there is also the overall system robustness and performance. By having the data for common operations in-house, so to speak, we can save a lot of time and effort for the entire system. It is easy to forget it during development, but queries and calls to out of process (and machine) have non-trivial cost to them that really adds up. I’ll point you toward the fallacies of distributed computing for more details on how common this issue is, and how fatal it can be for your overall system health and performance.
You might have noticed that beyond peeking a bit at the different database types, I haven’t actually dealt with the interactions of the different data models. That is because, for the most part, this is an implementation detail at the overall system architecture level. What is important is the notion of data ownership and data flow —identifying who gets to modify a piece of data (only its owner) and how it is distributed in the system.
Once that problem is solved — and it is anything but a simple one — the actual decision of putting the data in a table or a document model is a matter of taste, the specifics of the problem at hand, and (most importantly) a reversible decision that has a local, not global, impact.
Opinions expressed by DZone contributors are their own.