3 Lessons From Transitioning a Monolith to Cloud Native
3 Lessons From Transitioning a Monolith to Cloud Native
Read on for the story of GitLab's transition from a monolithic architecture to cloud native services, and what you can learn for your own transition.
Join the DZone community and get the full member experience.Join For Free
Learn the Benefits and Principles of Microservices Architecture for the Enterprise
Like many software products, GitLab started out with a classic monolithic architecture. We’re currently working on moving towards cloud native, but as with anything you attempt to convert retroactively, there are some challenges – foreseen and unforeseen – we’re bumping into along the way. Currently, the hope is to be able to use the cloud native chart for our primary production environment in roughly Q3 of 2018, if not earlier. We plan for our new cloud native Helm charts to be Alpha (not production ready) by the end of March. Things are moving really fast, but the team always remember that our move towards cloud native cannot compromise our customer experience. Here are some of the lessons we’ve learned along the way.
1. Accept Where True Microservice Migration Just Isn’t Possible
One of GitLab’s biggest challenges has been that we, along with all our customers, make use of our Omnibus package to provide all component services. It manages everything about the services: how to turn them on, how to configure them, and every stage in between. However, with cloud native, we can’t make use of the Omnibus because it’s a monolith. While the package is technically a monolith composed of a bunch of intertwined services, it’s still one very large installation. To replace that, we have to automate and tool the installation from source, as opposed to dealing with Omnibus. This means that while we can reuse some of the logic, we can’t use the tool. So, we’re reinventing the wheel on how we deploy GitLab.
Re-architecting things is never simple, but we also have to think about it from the customers’ perspective. How easy is it to configure? How can we make it easy while providing scalability between 10 and 1,000,000 users? We can’t say, “This is our new suggested method. By the way, it’s harder.”
The plan at GitLab is to phase the Omnibus GitLab package out of our production environment, for all stateless services. All stateful services would be provided by what is referred to as “pets.” For GitLab, these are our Postgres database, our Redis services for persistent cache and job queues, and likely our Gitaly services (our RPC-based Git services). We need to be able to provide these critical components in highly available (HA), highly resilient, but also highly configurable manner. For these, we will continue to make use of the curation provided by the Omnibus GitLab and use Helm to provide the configuration of all microservices that do not require local state.
Postgres in HA environments is a very complex topic, and attempting to bring those services into a platform such as Kubernetes requires that such platform be aware of complex state machines that these services use. Our partners and the greater community recognize that such capabilities are currently beyond Kubernetes. This may change in the future, but GitLab can’t risk ours or our customers’ experiences based on immature or incomplete solutions, so we’ve explicitly chosen to exclude Postgres from production deployments via Helm charts.
2. Know the Limitations of the Tools You Choose
We’re making use of Kubernetes and Helm, “best in breed” tools for what we’re attempting to do. These have a lot of velocity behind them, which is great because they are continuously evolving, but also means that while technically production ready, they are rapidly changing. There are certain behaviors we weren’t fully aware of regarding how Kubernetes and Helm handle interactions, and it would obviously be better if we’d known the existing limitations upfront, as well as the roadmap for fixing them. Do we wish that these tools were more mature at the time? Of course! But the reality is there’s no other tool that could have provided us with this option.
The only direct alternative might have been Docker Swarm, but that would have involved much more hand tooling. Helm is built on top of Kubernetes, and provides us with templating and information about how the deployment works. Whereas with Docker Swarm you have to provide all the individual configurations yourself: writing the tooling and template handling to create the configuration. Doing so would have created a higher maintenance cost, and would actually have made it harder for a customer to deploy than the Omnibus GitLab. Harder is the opposite of what we want at GitLab.
3. File System Dependencies Can Bite You
When applications haven’t been designed for cloud native from the ground up, it’s not uncommon for a shared file store system to be expected. In the past, horizontal scaling was handled by parallel NFS mounts across multiple nodes. While that approach functioned, it has risks. For example, Workhorse (a thin smart proxy in front of Rails) handles a user upload, places it into the file system and then tells our Unicorn where that is in the file system. If these two services don’t share a file system, Workhorse can’t tell Unicorn where it is. Worse, what if that NFS service goes offline? You have a suite of things that work well in tight combination, but they do so in such a way that it expects that they’re actually on the same filesystem, if not VM, not necessarily that they’re communicating strictly over the network with no actual shared state. We have a number of items within the components that make up our entire suite that currently assume there is some amount of shared state that’s not provided by a different, external network service. This has been a real challenge for us, and we can’t be the only ones who have services assumed to be on the same box. When we’re doing this cloud native work, the one thing that we have to assume is that there is no such thing as shared storage. Therefore NFS is not an option for connecting various components. When treated as microservices, everything has to be done over the network.
What We’re Doing to Address It
We’ve done a lot of work recently to mitigate the concerns of how GitLab handles shared storage. We have progressively introduced native support of Object Storage for LFS objects, CI logs and artifacts, attachments and more. We’re doing this in such a way as to remain agnostic to which provider a user may subscribe, such as Amazon S3, Google Cloud Storage, or even a local Minio instance.
In terms of Git content, we’ve long since begun work in the form of Gitaly, a Git RPC service that we are currently rolling out across GitLab.com, to replace our legacy NFS-based file-sharing solution. The intent is that you don’t need to have every single node having a mount from a shared NFS server. Instead, you have each node performing network calls to the Gitaly service which then handles the actual Git calls on the filesystem local to it. This reduces the risk of single points of failure by distributing the load and actually improves I/O performance because of file system caching on the endpoint. This caching improves performance, as by nature, Git handles many small files in rapid succession, which is exactly what buffering is designed for. A very large, heavily used NFS server has to track all files across all mounts. A Gitaly node only has to worry about the data hosted by it. By distributing the total load across many nodes, we get higher performance from each storage node and greater performance from the suite as a whole.
Unsurprisingly, changes as large as these, and replicating the ease of use from Omnibus GitLab into a cloud native architecture takes time and concerted effort. We’ve had to be somewhat reactive in our approach, adapting to curveballs that have come up along the way, but we’re hoping that the migration will have a positive impact long into the future. If your organization is attempting something similar, I hope what we’ve learned will be useful to you. As with everything at GitLab, we’ll keep iterating on it transparently and welcome your feedback.
Opinions expressed by DZone contributors are their own.