I want to quickly paint the picture in my head about distributed systems (maybe it’s a sloppy picture, but nevertheless). When we talk about microservices, we talk about using microservices as a vehicle for building business Agile IT systems, or systems that allow a business to more quickly change, build new functionality, experiment, and stay ahead of its disruptors and competition (startups, etc.).
As part of autonomous systems that interact with each other to provide business agility, we also need to consider what happens when parts of these systems fail and how a system reacts to overcome failure. A central pre-requisite to being able to build Agile, failure-tolerant systems is autonomy. Autonomous systems can evolve independently from each other because they tend to shed dependencies on other systems, teams, and processes. Changes to a service A shouldn’t force system B to change, nor will there be any other ripple effects. If service A (on which service B depends) goes down, service B should not just blow up.
Where do we have examples of this autonomy in other systems outside of microservices? Well, if you follow the real reasons why microservices are a success, then you know it’s not the technology per se that enables Netflixes and Amazons of the world to be successful with microservices: it’s the organization system structure.
Some examples of these same types of Agile systems include open-source communities, cities, stock markets, ant colonies, flocks of birds, and countless others. They can evolve, react, and even continue on in the face of massive failure. In fact, they’re a well-studied bunch in the field of Complex Adaptive Systems theory. The underlying common themes between these systems? Purpose, autonomy, and reaction to their environments. These autonomous agents react to events.
When something happens, an autonomous agent (ant, person, service) can do something or do nothing, but it’s these events that drive the behavior in complex adaptive systems. Think about how you (as an autonomous person) do things throughout the day. You wake up, you dress based on the temperature (an event or fact), and you get in your car and drive to work (stopping at stop lights, avoiding the people driving erratically, and partaking in other events). These are all responses to events. You get emails in your inbox, you respond. You get a text from your wife to pick up dinner on the way home, etc. We live our entire life responding to events. IT systems built on events can be made to be equally autonomous, scalable, and resilient to failures.
Going From Authority to Autonomy
In most distributed systems implementations I’ve seen, we tend to extend the notion of building systems within a single address space to building across an unreliable network. This is a bad idea for many reasons, but many times, it appears to be the simpler approach. We tend to invoke remote objects to prod them to do something, or we call a remote service to look up data. Maybe the tax service is the canonical location for anything to do with tax calculations. If we’re a shopping cart service, we need to calculate the final price for the items in a shopping cart during checkout. So, the shopping cart service calls the pricing service. The pricing service may also call the tax service to do some other adjustments to the price based on shipping location (country, state, city, etc.). The tax service may call the catalog service (taxes may be different depending on product). The shipping service may also call the inventory service, etc.
We may end up with these long strings of calls (which may be okay in monolith application where all these objects live in the same address space, etc.). We’re following the authority pattern of accessing data; we call the service that has authority over the data. To me, this feels a bit like shared global state and tons of mutexes and synchronization points. It also has nasty implications in terms of transactionality or ACIDity of a series of calls to authority.
This can lead to bottlenecks. It can also lead to hung services and cascading failures if some of these services in the chain are unavailable. It can also lead to weird dependencies where something like the inventory service now has to expose data in a certain way for the tax service and something different for the shipping service to consume. Or, it exposes the data in one single format with lots of additional details that neither service really cares about.
What if we looked at this model differently? What if we invert the model? Instead of relying on and invoking services for their authority on certain matters, we rely on time and events (like we do in the real world!) to understand context about our environment before our service even gets invoked? What if we were able to listen to our environment and find that shipping from the USA to Cuba has just introduced a lower tax that it once was? This is a fact that we can observe and react to, or we can just ignore it and do nothing. What if we could know that the tax on shipping to Cuba is now lower and capture that data so we could know it for future queries about shipping to Cuba when we display the shopping cart page? Then, we may have a little more autonomy over our data and our service.
We could store that information, or derivatives of that information, in our own databases which would be optimized for the types of service we provide. If we have to make a version change to our service we can just focus on what it means to version our own schemas and data and not have to worry what happens when other dependent services change.
Embracing Eventual Consistency
Responding to events instead of “just-in-time” querying for authority allows our service to be more autonomous, fault-tolerant, and resilient.
However, one thing that affects autonomous complex adaptive systems in reality that also affects autonomous event-driven systems is delays.
If you are notified of an event immediately, you can react immediately. For example, if a car is swerving into your lane and you see this, you can quickly hit the breaks or adjust your driving to not collide. However, if there is some kind of delay in observing this event, then your reaction may be slow (maybe you're driving impaired or playing on your cell phone or yelling at your kids for doing something, etc. …okay, please don’t send me mail about how to be a parent!).
This can also happen in IT systems. Let’s say that I order something on Amazon. This publishes an event, or fact, to other autonomous services (like order processing, billing, inventory, etc.). These systems can observe this event, but what if the inventory system is disconnected from the network for a few minutes, hours, whatever? When they come back, they will eventually see the event and proceed to check inventory, etc and publish any events it deems necessary (i.e., react) like “InventoryReserved” event or “InadequateInventory” event. This is a simple example of a set of autonomous systems eventually becoming consistent.
What Technologies Are at Play Here?
I have one last thing to say about events, delays, and autonomy here. Events are only useful if we can capture them and observe them in the order they occurred. That is, total ordering over a set of events must be preserved for our systems to have any confidence in how to react to them. If you start to squint, then you can see how “ordering” also plays a role in how we construct “transactionality” across systems (more on that later). If we start seeing events out of order, then we can never claim to get to eventual consistency without some kind of manual intervention. Martin Kleppmann calls this perpetual inconsistency.