Open Source Solution for Building Invincible Apps
Building microservices and distributed systems is complex, so developers frequently cobble things together with scaffolding. Now they don’t have to.
Join the DZone community and get the full member experience.Join For Free
I left Microsoft after 19 years, where I led teams that built system software for highly scalable cloud applications. This included leading development of the Microsoft Orleans framework from its inception at Microsoft Research until it became one of the most successful open-source projects within the .NET ecosystem. Orleans powers a number of large-scale Microsoft systems such as Xbox Game Services, Skype, Azure IoT, Azure ML, Azure Active Directory, and many more cloud services outside Microsoft. So if you’ve ever played online multiplayer games like Halo or Call of Duty, our team built much of the underlying infrastructure that supports it.
When I originally joined Orleans, cloud computing was still in its infancy. We had a 10,000-foot vision and not a single line of actual code. We needed to reimagine how cloud-scale applications should be coded because, at the time, available and high-performance scalable systems were only achievable by experts. And while everyone knew the cloud was coming, we had no idea how to build applications in a way that ensured they would be accessible and productive for millions of software engineers.
The overarching question — which many developers still struggle with today — was how to reduce the amount of complexity inherent in application development. I ended up spending half my career at Microsoft working to solve this problem. The answer was designing a programming model that would change how developers conceptualize applications, how they structure code, and ultimately what kind of mental models they build looking at the problems they face. That challenge captured my interest throughout the rest of my time at Microsoft, and continues to this day in my new role at Temporal.
In the last decade, tooling and infrastructure have rapidly evolved. Kubernetes, AWS, and Azure just to name a few. There are hundreds of various services that can help developers build applications. This rich set of available off-the-shelf services made it easier to build sophisticated application logic by leveraging somebody else’s code instead of writing your own. But at the same time, it inadvertently made cloud applications even more distributed than before.
It is relatively easy to implement and test a happy path in application code. The more difficult part is to make sure the code correctly handles all possible failure conditions. The degree of difficulty grows exponentially with the number of remote dependencies the application interacts with. It’s the same challenge that’s always been there, but now at a different level of complexity.
When I was working on Orleans, we focused on creating a programming model, where it would seem like you were programming a single computer yet would expand if needed to dozens or hundreds of servers. The programming model of Orleans has proven to be successful in reducing the complexity of building scalable cloud services. Its core premise of always available uniquely addressable objects (grains) resonated well with a broad range of developers. The “always available” part helped make the application code more reliable. If one of the servers where the application runs fails, the application quickly recovers and keeps running.
The other part of the reliability challenge — what to do when a call to a dependency service returns an error — we left to application developers to solve. Part of the reason we didn’t try to offer a solution there is because it is very difficult to have one that would fit the broad range of real-life scenarios. And of course, it’s very hard to get it right.
I wrote a blog post a few months ago about the typical patterns of dealing with failures of remote calls here: https://dev.to/temporalio/dealing-with-failure-5adf. It shows the complexity and inherent tradeoffs of this problem, and that’s what developers have to deal with on top of solving the actual business problems. They are forced to make decisions about what to do when they get an error. Do I retry? Right away or after a delay? How many times? What if I crash in between these calls? And so on.
Scoping the Problem
I found myself face-to-face with these challenges when I led a team that ran a critical service for Xbox games, one that handled a massive amount of multiplayer game servers. We were allocating VMs for multiplayer games like Forza, Halo, and Minecraft, so players could join from their consoles and play together with their friends or compete with millions of other players. From the players’ perspective, it is taken for granted that everything works immediately and smoothly. But on the service side, it was daunting.
Each game requires a VM that needs to be allocated and configured. Then it needs to be provisioned with the game server code - gigabytes to download. Once all the required bits are on the VM, the game server needs to start and eventually report that it’s ready for a multiplayer match. As you can imagine, each of these steps takes multiple remote calls that may fail. And at this scale, they do fail on a regular basis.
Requested operations take time, from seconds to minutes, and the service code has to anticipate failures, operations getting stuck for unknown reasons, a policy for retrying and giving up, and a recovery plan for the case if the service code gets restarted in the middle. This is all before a multiplayer match is even started on the game server. After that, the match has its own lifecycle that needs to be handled with similar expectations about failures.
This is a simplified version of the model, but it provides a small glimpse into the lower level mechanics of developing cloud services, in this case for games. There are similar challenges in virtually every industry today, be it financial transactions, e-commerce, IoT, infrastructure management, and so much more.
It All Comes Back to Workflow
If we generalize the problem that application developers solve in all these cases, we can view them as workflows of various kinds. Each workflow is a business process with multiple steps or actions that have to be executed to completion despite occasional failures of some of them. While the example above is specific to gaming, the underlying problem is not. With the rise in distributed systems, programmers frequently find themselves stitching things together, addressing the same problems time after time after time. How do we deploy clusters? How do we monitor them? How do we deploy distributed databases? What happens when we get an error?
Every developer is familiar with a solution that looks something like this:
But if we remove all this ugly complexity that adds no real business value, we can focus on the application logic instead. If we replace the duct tape boilerplate code with a transparent built-in mechanism, this diagram becomes much simpler and much more elegant. And this is what we’re doing at Temporal.
For me, it’s a broad-scope effort, helping developers build bulletproof applications without all the unnecessary scaffolding. In a lot of ways, it’s a continuation of what made me passionate about working on Orleans. The goal is largely the same — to move away from the world where millions of developers are forced to solve the same reliability problems over and over again, to a better world of elegant code solving the business problem at hand. Today we see a lot of boilerplate that takes the focus away from what’s important for the business and pollutes the codebase with thousands and thousands of lines of difficult-to-maintain code.
Microservice architectures, with all their benefits, made the problem even more pronounced. Today’s approaches for working with microservices simply do not meet the scalability and reliability requirements of modern applications, and to help we’ve introduced an open-source solution for microservice orchestration. And it is open source, MIT no shenanigans.
If you decide to try it out, I’d love to hear your candid feedback, ideas, and thoughts. Or if you’re interested in joining our team, we’re always looking for developers who thrive on solving these types of complex challenges.
Opinions expressed by DZone contributors are their own.