The Truth About Your “Source of Truth”
Ways to make your Source of Truth more accurate include having more than one source and ridding oneself of unwanted dependencies.
Join the DZone community and get the full member experience.Join For Free
Too often when designing our distributed systems, we find ourselves wrapped around the axle, trying to define a single source of truth (or SoT) for our domain entities. Common wisdom has it that each domain entity in our enterprise should live in exactly one, centralized location. If we want to fetch an instance of that entity, we go to that location.
As with much common wisdom, this isn’t necessarily so. It’s good to know where our data resides, sure. Bending over backward to define a single SoT for every piece of data is the wrong approach. Rarely is it necessary?
Worse, it can cause more problems than it purports to solve. The very notion runs counter to the event-based systems that power many enterprises today. As we’ll discuss, a true source of truth for our data is, by and large, mythical.
What’s Wrong With a Single Source of Truth?
Before we get into the problems of single SoTs, let’s examine why we build microservices architectures in the first place. We’ll start with one of the most fundamental patterns related to microservices: the Bounded Context. Much has already been written about the pattern. The general idea is that each business domain should be thought of as its self-contained system, with judiciously-designed inputs and outputs.
From an organizational standpoint, each Bounded Context is owned by a single, cross-functional team. The team builds, deploys, and maintains the services that it needs to get its job done — with minimal dependencies on any other teams.
This leads us to the most fundamental benefits of a microservices architecture. While we can rattle off the technical advantages to using microservices, the true benefits are organizational. Teams gain control and responsibility for their projects.
They can make as many changes to their code as they want to, with little fear of breaking other teams. They can release as frequently as they need, without requiring coordination with any other team. They own the entire lifecycle of their code, from design and development to deployment, to production monitoring.
Enforcing a single source of truth for our domain entities sabotages those benefits.
Instead, single sources of truth re-establish dependencies across teams and bounded contexts. It forces teams to break out of their bounded context to perform work. If the team requires a change to an entity, it is now at the mercy of some other team’s schedule to get the changes made. Consider production issues. Teams may easily now find themselves awoken at night, paged because some other teams’ service is misbehaving.
Single sources of truth also equate to single points of failure. Requiring all services to fetch User data from a single location means that our entire enterprise may grind to a halt when the User service has issues. SoTs also introduce performance and scalability bottlenecks. As the number of applications that need to access that User data grow both horizontally and vertically, the load on the User service will effectively grow exponentially.
Perhaps most importantly, adherence to single sources of truth severely hampers our ability to move forward with a workable architecture. This becomes apparent as teams wrap themselves further and further around the axle, arguing about the SoT for the such-and-such entity.
Have you ever found yourself pulling multiple different teams together to debate the design of your microservice’s API? Trying to meet every team’s requirements? Negotiating compromises when those requirements contradict each other? Yeah, me too. That’s a clear smell that we’ve been doing something wrong.
Why again did we bother with a microservices architecture?
We need to relax. Stop worrying about our “source of truth”. In all likelihood, the “single source of truth” we’re seeking doesn’t even exist.
The Quest for the Holy Source of Truth
At least, it probably doesn’t exist among our microservices. Let’s take a standard inventory management system containing a service that tracks physical inventory items, which in turn are stored in a physical warehouse.
Every time a new item is delivered to the warehouse, its barcode is scanned, and the service updated to reflect the new amount. Thus, the inventory item service will be our reliable, single source of truth for our enterprise’s product inventory…right?
Except, how accurate will the database be after warehouse workers walk out with their pockets stuffed full of items at night? Or when a leaky pipe in a warehouse corner causes packages to slowly disintegrate? Or when a palette of new items is simply missed by the scanner?
Let’s try another example. Say we've built a company that aggregates deals on hotel rooms. We partner with hotel chains, ingesting underbooked rooms from the chains, and presenting the best deals to interested would-be travelers. To do this, we have a HotelPartners bounded context that ingests room data from our partner hotels and stores it in a database.
So… is that database an SoT for that data? Not really. We'd sourced that data from its partners’ databases, which in turn are representations of the availability of their physical hotel rooms. Any “source of truth” for this data certainly doesn’t reside within our organization.
The point is, obsessing over the “one true source of truth” for our entities can be a fool’s errand. Often, there isn’t any such thing—at least, not in our enterprise system.
Instead of searching for the elusive absolute, single source of truth for anything, we should instead embrace the fact that “sources of truth” are relative.
Embrace the fact that “sources of truth” are relative
Recognizing this fact is remarkably liberating.
Think in Terms of Canonical Views and Scopes
Rather than a single SoT for our data, we can instead think in terms of the canonical view of data within a given scope. Within the scope of any system that stores data, there will be a data store that represents the system’s most up-to-date view of that data. That is the system’s canonical view.
There may be additional data stores in the system that also provide views that data. Maybe the data is cached for quicker read access, or enhanced with data from another canonical source. But those additional data stores are always subservient to the scope’s canonical data source.
As an analogy, think of materialized views in a database system. The original table(s) from which the materialized views are derived represent the canonical view of the data, within the scope of the database schema.
Organizational and Industry Scopes
Let’s revisit the hotel room aggregator from a few paragraphs back. As engineers in this semi-fictitious organization, we fetch information about hotel rooms from our third-party partner sources. The data enters our system and is transformed and stored, all within our HotelPartners bounded context. Other product-focused bounded contexts in our organization then use that data for their purposes.
In the scope of our organization, this HotelPartners bounded context contains the canonical source of hotel room data.
Now let’s zoom out and look at the industry as a whole. We’ve sourced this data from hotel chains. So the hotel chains’ databases become the canonical source in the scope of the whole industry.
We can also zoom in and look at our organization’s specific bounded contexts. Much like our entire organization sources data externally and stores its canonical representation, so can our bounded contexts. Specifically, they can source the data from the HotelPartners bounded context and store their local copy.
Our product teams now have their local canonical source of hotel room data. They are each free to enhance the data as they need, maybe from other bounded contexts. They can also store it in whatever format they see fit, and are not reliant on the HotelPartners team to make changes for them. Nor are they reliant on them to maintain a particular SLA for their RPC services.
The product teams are also free to create other secondary data stores for the data. For example, the Search team might set up a secondary data store, say an ElasticSearch index, to support searching data across various axes. This secondary data store would still be sourced from Search‘s canonical data store.
Meanwhile, the HotelPartners team is not burdened with creating and maintaining a “one-size-fits-all” data model, in which they try to make every product team happy in terms of the data fields that are stored.
Events Make It Happen
If you’re not familiar with event-based systems, you might be wondering how a product-oriented bounded context (e.g. Search) is supposed to derive its data from its enclosing scope’s canonical source (i.e. Hotel Partners). Wouldn’t it still need to make calls into Hotel Partners’ API services?
As it turns out, it doesn’t. Instead, the canonical source publishes its changes as events. Generally, we use an event log like Kafka for this purpose, but the details don’t matter here. What does matters is that the product-oriented bounded contexts can subscribe to those events, ingest them, and store the results in their own data stores?
Figures 1 through 3, then, are a bit too simplistic in depicting how the Booking and Search bounded contexts derive their data. Figure 4 provides a better look.
The Hotel Partners bounded context ingests data from the external partners and saves it into its Rooms database. Once it saves the data, it creates events that describe the saved data and publishes those events to an event log. The Booking and Search bounded contexts subscribe to that event log and, as the events come in, consume them and populate their own data stores.
There are a few other items of note. First, the Search bounded context uses the same event-based mechanism to propagate changes to its canonical source (titled “Room Details” in the diagram) to its secondary search index. Also note that the Search team is subscribing to another event log (titled “Event Log: Reviews” in the diagram) to augment its inventory data store with customer reviews.
We commonly refer to such systems as event-based systems. They are distinct from the more traditional request-response systems that are powered by synchronous API calls, and that tends to drive the desire for single sources of truth. Event-based systems inherently imply eventual consistency. What this means is that at any given point, data across our organization may be in different states.
Absent any new changes, the state will converge. Since change is generally a constant, this means that for some periods (often measured in milliseconds) a given entity might be more up to date in one context (say, Hotel Partners) than it is in another context (e.g. Search).
The good thing is that in our well-designed systems, with proper bounded contexts, eventual consistency is perfectly acceptable. Back to our example, the Hotel Partners bounded context will first ingest data. The data then flows—at roughly but not the same time—to both the Booking and Search bounded contexts.
At this moment, a given entity may be out of sync between each bounded context. However, since each bounded context represents a separate business domain—with its applications and functionality—brief inconsistencies become mostly unnoticeable.
If You Must Obsess, Then Obsess About Your Data’s Originating Source
Often when looking for a “single source of truth” — that is, a single location from which to internally fetch an entity — what we're seeking is a single “originating source”— that is, a single location from which to externally ingest an entity.
In other words, any given entity should enter our organization via a single location. It follows naturally that the data will flow through our organization from that entry point. Moreover, it will flow in a single direction.
Let’s take Figure 3 from the previous section (with the understanding that we’re using an event log like Kafka to propagate data). There, we see that our organization ingests available hotel rooms from external sources by the Hotel Partners bounded context (perhaps via batch file ingestion). The data is distributed to our org’s various bounded contexts (e.g. Search, Booking, etc).
Now let’s say that Search has added a new web application, allowing registered individuals to manually add room inventory into our system. Suddenly, we have two locations in our system in which we ingest the same data.
Why is this a problem? The complexity of managing the flow of data in our system has dramatically increased. If we have data coming in through both Hotel Partners and Search, both bounded contexts will need to publish their incoming data as messages. And of course, both will need to consume each others’ messages and make appropriate changes.
For example, Hotel Partners will need to consume messages from Search and update its database. Should it then publish that change as a message, which Search would subsequently consume? If we’re not careful, we’ll create an infinite loop of messages. What about Booking, which now needs to consume messages from both Hotel Partners and Search? Is it now responsible for sussing out from which service the data originated?
Next, consider conflict resolution. If someone uses Search’s web application to push data that conflicts with other data that we’ve ingested from industry sources, who decides how to resolve those conflicts?
Similar to conflict resolution, we have the issue of deduping. If we receive data from multiple sources, odds are that we’ll routinely ingest duplicate data. If this data enters our system in multiple places, where would this deduping take place? We’ll discuss deduping a bit more in the next section.
So we should be cognizant of where our data originates. And if at all possible, we should limit any given entity to a single originating source. If we truly need to allow users to add room availability data to our system, we should allow that only within the same scope in which we ingest bulk purchase data (that is, the Hotel Partners scope). That way, our data flows in a single direction, and propagation becomes much easier to reason about and much less error-prone.
As I discuss the concept of relative SoTs and context-based canonical sources of data, a few questions tend to arise. Let’s discuss them here.
What About Master Data Management?
We might question whether this idea of canonical data sources runs counter to the industry practice of Master Data Management, or MDM. Put simply, the use of MDM helps organizations ensure that they do not have duplicate representations of the same piece of information floating around various internal groups. Importantly here, MDM implies that an organization must have a single canonical source of every entity in the company’s business domain.
Despite first appearances, MDM doesn’t run counter to the idea of relative canonical sources. As discussed above, the organization will still have a canonical source of its data, relative to the organization as a whole. In our MDM system, we would dedupe records and assign them a unique ID. In turn, the records stored within the other bounded contexts would retain the entities’ canonical IDs.
Meanwhile, the MDM data store can and should remain relatively lightweight. It’s perfectly feasible for such data stores to house little beyond entity IDs, and perhaps some other basic identifying information.
What About Single Versions of Truth?
Sometimes we find ourselves with different applications that perform calculations on their view of common data. In such cases, wouldn’t we be leaving open the possibility of differences between the calculation algorithms? In that case, we might wind up displaying different results for the same data across different applications.
This is related to the concept of a single version of truth (SVoT). This concept states, roughly, that if multiple systems have their view of the same data, there must be one agreed-upon interpretation of the data. Often SVoTs are referenced in the context of business analytics and decision-making but are also applicable when discussing distributed systems.
The truth is that while we don’t often need to worry about creating a single source of truth for our data, we sometimes need to define a single source of truth for our algorithms.
For example, our hotel-room-aggregation organization might provide ranked recommendations for would-be travelers. If the recommendations can appear in multiple places, then we’d want to show the same recommendations for a given user, no matter where they appear, or from where the raw data was sourced. From that standpoint, then, while the data can live in multiple locations, we need to ensure that a single algorithm is used to perform the calculations.
How do we ensure a single SoT for a given algorithm or calculation? We have a few options.
First, we can package and distribute it as a library, to be imported by all of the services that need it. However, reliance on libraries in a microservices architecture has some major drawbacks:
- Coordinating changes is difficult. If the algorithm was to change, we would need to build and deploy a new version of the library. Then, we need — in a coordinated fashion — to redeploy all of those services. This breaks a fundamental tenet of microservices: independently-deployable services.
- We’ve tied ourselves to a single language. If our services are written in different languages, we’d need to repeat the algorithm in different languages. And then we no longer have a single SoT for that algorithm.
Alternatively, we can deploy a microservice to calculate the data on the fly. In this case, we’d require applications from various bounded contexts to synchronously call into this service to get the calculated data. While this approach may be preferable to a library, it gets us back to the original issue that we’ve looked at in this article: we now have enforced a single source-of-truth for our calculations.
So then, why can’t our new microservice simply perform calculations as hotel room availability changes, and publish those calculations for different bounded contexts to consume? That is, after all, how event-based systems work.
With this approach, our new microservice becomes a consumer of the data produced by Hotel Partners, just as our other bounded contexts’ microservices are.
Are We Saying That a Single Source of Truth Is Always a Bad Idea?
We’ve seen that the enforcement of a single source of truth is often unnecessary, detrimental, and in many cases, a fallacy that we’re trying to make a reality. Still, we may encounter some data for which we won't enforce a single SoT. As we discussed above, eventual consistency may not be acceptable for certain data. We may have services that are infrequently accessed, are not in any critical paths, and have simple, infrequently-changing APIs.
Authentication is a common cause. Typically, we want a single location to manage a user’s login credentials, roles, permissions, and tokens. Generally, these activities happen (relatively) infrequently, but we want to be sure any updates (changes in passwords/permissions, logged-in status, account lockouts, etc) are reflected immediately across all applications.
We might opt to define a single source of truth for our authentication data. Even in this case; however, we would want to be sure that we keep the authentication model as light as possible. User details such as name and contact information, for example, would go elsewhere.
This is not a golden rule to which we must adhere. Despite common thinking, single sources of truth for our data are not an automatic requirement. Moreover, they are often a hindrance in terms of both productivities as well as application performance and scalability.
Designing our systems instead with canonical sources and scopes in mind will help us avoid bottlenecks, allow us to design more flexible, scalable event-based systems, and lets our teams focus on building their stuff.
Opinions expressed by DZone contributors are their own.