There is no such thing as a free lunch; almost every technology or architecture choice we make comes with pros and cons. Martin Fowler and his colleagues at ThoughtWorks have written an excellent article about the tradeoffs involved with using a microservices architecture. This article goes into some detail about one particular issue that is not covered there, using Spotify’s software cataloging system as an example.
One of the benefits of a microservices architecture is that it helps you scale your development organization, through a combination of strong module boundaries and independent deployment. So you can have more teams build services more quickly. This, like so many good things in computer science, is a double-edged sword, since the ability to build many services quickly tends to lead to...many services. When you have a large ecosystem of many small services, even if each service is simple, the sheer volume means that understanding the ecosystem becomes hard.
At Spotify, we’re currently around 100 teams that independently build, deploy, and run microservices in our backend. We’ve got nearly 1,600 services in our catalog and about 1,000 names are registered in our service discovery systems (see below for more on the difference). This means we’re long past the point where word-of-mouth is enough to find who owns a service and what it does. We’re on the second generation of tracking tools, using an in-house tool called System-Z to catalog our software.
System-Z: A Software Cataloging and Tooling System
System-Z consists of a set of microservices (obviously!) and a web UI (see top right).
At the heart of System-Z is a service called "sysmodel," which tracks statically configured metadata about the various microservices in our ecosystem. This information has the same kind of problem as any documentation: the person or team who is capable of providing it is not the person or team who most benefits from it being in great shape. Since it is beneficial for Spotify as a whole to have great quality metadata about our services, we try to encourage teams to keep their service metadata up to date, through, for instance:
Storing the metadata together with the code, clearly highlighting that the metadata has the same owner as the code.
Making it easier to use tools for managing your services if the data is good.
Showing warnings and hints that make your services look "untidy" if the data isn’t up to date, hopefully appealing to engineers’ sense of cleanliness to x the warnings.
In order to get some data that is dynamic by nature (where is this thing running, which version is it, etc.), and to improve the reliability of some data (which other services is it actually calling, etc.), we also collect runtime data by polling the running instances. The metadata is created and published by our backend framework Apollo.
System-Z has also become the place where most of the tools for managing backend services are made available, and most of those tools tend to be built on top of the data that the sysmodel service provides.
Some of the core concepts in the sysmodel data model are:
- Software component. Although System-Z and sysmodel were developed to support our microservices, we track not only microservices here but also data processing pipelines, libraries, third-party tools like Jenkins, etc.
- Role. A function that we want to scale to a certain amount of users, or deploy to a certain number of availability zones, to specific geographic locations, etc. An example could be ‘login’, which is what allows Spotify users to log in. A Role is usually instantiated on a (virtual) host and requires one or more components to be running on the host in order to work. Roles are typically scaled horizontally to a sufficient number of hosts. A Kubernetes Pod is a good example of an implementation of this concept.
- Project. A set of roles that are related; could, for instance, be the ‘login’ role together with the data store that contains user data.
- Recipe. A description of the software components needed to be installed together on a host that plays a role. A Kubernetes pod template is a good example of an implementation.
- Discovery name. To express dependencies between microservices, we use what we call a discovery name. This indirection allows us to do things like inserting a proxy in front of an existing service, for caching, upgrading, deprecation or dark launching purposes, helping us run and improve our product without downtime. A deployed component can register zero or more discovery names.
- Owner. One of the most frequent questions about software in our catalog is "Who owns it?", as knowing that is necessary to understand who to ask, "How do I use it to do X?"
The actual format of the data the sysmodel service reads is free-form YAML, meaning that users are free to add their own service metadata if desired. We decided to expose the sysmodel service publicly within Spotify when we launched it in May 2015. A review in September 2016 of what people internally are using it for found no less than 18 different use cases, from business rules for server access control (“if you’re a member of a team owning service X, you get login rights to server Y”) to automatically updating all monitoring dashboards for services owned by a particular team. Most of the use cases people found for the data served by sysmodel were things we had not anticipated when building it.
A great thing about the freedom that a microservices architecture gives to teams and individuals is that it allows the decentralized creation of many small components. This improves the speed of experimentation and learning, but leads to an increased volume of software, which in turn makes it hard to understand the software ecosystem: what should be out there, who owns it, and how does it fit into the bigger picture? Can I deprecate it? A microservice catalog like System-Z simplifies this understanding, and also serves as a great base for other tooling that works across the board with your backend systems.
What’s more, people and teams at Spotify are right now creating about 20 new services per week. Most of these are either for learning how to build a backend service or experiments and will never make it to production. This number has grown by leaps as we’ve made changes that make it easier for teams to build and deploy a service to production. We believe that being great at learning quickly is a strategic advantage for us, so this is a pleasing development.
We’ve still got a number of things we want to improve in System-Z. Number one is probably making it (or most of it) open-source. Another is API discovery; being able to go from use case to a service API that solves your use case. Despite its shortcomings, it does make many things easier that used to be hard and makes some things possible that were impossible before.
More EI Goodness
If you want to see other articles in the guide, check out: