Modeling Microservices at Spotify with Petter Mahlen
Modeling Microservices at Spotify with Petter Mahlen
At a microservices talk in March in Sweden, Petter Mahlen, Backend Infrastructure Engineer at Spotify, spoke to a packed house at Jfokus about microservices.
Join the DZone community and get the full member experience.Join For Free
This is a talk about modeling microservices. First, a little bit about me. I’ve been building software since 1995 and at Spotify since 2013, so coming up on three years now. At Spotify, I’ve been working with building various kinds of infrastructure for instance Nameless, which is our service discovery solution. So, that’s where if you spin up a service on some machine, then it registers with Nameless and then it’s discoverable so it can take traffic. I’ll talk a little bit about that because it’s coming back later in the talk as well. I’ve also been working with our drop wizard analog for those of you who are familiar with that. It’s called Apollo and we open-sourced that in November (I think it was) last year. I have also been working on System-Z, which is the main topic of this talk. Right now I’m doing data infrastructure, so collecting events and things that happen in our various systems and then making them available in Hadoop for processing.
About This Talk
The talk is divided up into four parts. First, a section with some kind of background about what I mean by modeling microservices, why do you want to do that, and what are the problems we’re trying to solve, and then a description of what does our solution look like, then how did we design it, and very brief conclusions and what’s been the impact so far. I hope that you guys will learn something about running microservices at scale, whatever that means.
Why Model Microservices?
When I say scale, I like the scale cube from a book called The Art of Scalability, and there they divide up the way that you scale things according to different axes. Typically, the first thing that you do is if you have a single web server and a database, then when the web server can’t handle it anymore, you split it, so you have two clones, two identical copies, and then you can add more and more of those. So that’s the x-axis scaling. More copies of the same thing. Along the z-axis, you have sharding. So, when the user database can’t handle the load anymore, then you split it up and you say, all users whose IDs end with zero in this form, and one in this one, and so on, so you shard it up using some sort of algorithm. Then the Y axis is going to split a system up into different components, so rather than having copies of the same thing, you say, “Here’s the login thing” and it does one thing and I scale it in one way, and, “Here’s the search thing” and that’s something else. You have more copies of different kinds of things.
What happens is when, at Spotify at least, we don’t do much sharding ourselves. There’s one place I know that we do that, but generally, we sort of leave that to tools like Cassandra and things that manage our data. We don’t do it manually. We do both the x- and y-axis scaling and when you move out on these axes, then even though every single thing you have is simple, the interactions between them can become quite complex and you get a complex system like that. To give some numbers, we have something like 10,000 servers in our data centers and on the y-axis, we have about 1100 different things. I’m going to talk more about what I mean by “thing.” It’s not exactly a microservices. It’s slightly different. I think another point that’s equally relevant for this talk is that we have about 100 teams that write code so that’s another aspect of scaling. You scale your development organization. That also leads to a need for information of the kind that this talk is about. I’m not sure, but I think maybe I’m going to talk about how teams work at Spotify.
Teams at Spotify
The reason is I think maybe the way we’re organized is like adding this need for metadata. This is a couple of slides I’ve stolen from another talk, which shows that, you know, we have squads. We call them squads. A squad is a group of 7 +/- 2 people, and they own something end-to-end so there’s a squad that owns the search function at Spotify. That means that they own the user interface on all of the clients, so for Android and iOS and whatever. They own the backend services that do it. They own the quality of it so they do the testing themselves and they own the operational responsibility for it end-to-end full ownership. They also decide what changes are made to the search feature so they own the product development as well. That’s an example of a featured squad. Besides, we have other kinds of squads, so infrastructure squad that builds things like System-Z. That’s built by an infrastructure squad. It’s a tool for the feature squads and the client platform squads, which also build other kinds of tools for feature developers.
Also, Spotify also has a history of growing very quickly, doubling the number of developers every year up until a year and a half ago. Now again, growing and we’ve done that basically through cell division so a squad, you add people to a squad, it becomes too big, split it in two and they get different things to own. The squads are autonomous, I mentioned that. They make their own, make up their own minds about what they’re going to be building. That means that there’s no, or very little at least, like formal or enforced communication between squads. Squads don’t necessarily communicate with other squads, but the decisions they make…The good thing obviously about that is that you get quick decision making. The bad thing is you solve the same problem in different ways and different parts of organizations. It’s a trade off. I think it’s a good one. It does mean that you have some extra complexity when you want to understand what all of the squads do together. There’s a link there to a video that I think is really interesting. If you haven’t watched it I would recommend it because it’s a nice video by Henrik Kniberg talking about these things.
Problems to Solve
The scaling up on the x- and y-axes especially, and the way that we work together, the fact that we have 100 squads that don’t necessarily talk so much leads to a bunch of problems that we need to solve. They’re very much about, you know, understanding and discovering. What is the system that we have? What are these 1100 things that we have? Where are they deployed? If I own a service, do I know it’s running where I think it should be? How do I know? How do I find out? The system as a whole, how does it fit together? If there are 1100 things out there, how do they call each other? What’s the graph or web of interactions? How do I find out more? If want to call this thing because I want to get data about something, how do I find out how to use it? I need to understand maybe, or if I want to have a feature added to it so ownership is one of the key things that’s hard. Also, if something is broken, what is it and how do I fix it or how do I get somebody to fix it? Taken together, these things lead to a need for systems metadata and that’s what we have in System-Z. That’s about the background. I can’t see any of you, so I don’t know if there’s understanding in your faces or just confusion. I’m hoping for understanding, but I don’t know.
What Came Before
[9:20] Right, so the next section let’s talk about what we’ve done. How it works from the surface, but before that, let’s talk about what we had before System-Z. We had Emil. Emil is the operations director at Spotify and he labels his reign as the systems metadata system as “rumor driven development”. Basically when Spotify was smaller, both in the number of teams and the number of microservices we had, it was possible for one or a couple of people to know, like, “I want to do this.” “Okay, you need to call this service and if you want to find out, then talk to this squad.” Eventually, the growth was such that it wasn’t possible for a person to keep up with it anymore. Then we introduced something called ServiceDB, which had this responsibility of understanding. It’s a database about our services. It should have all the metadata about our services. It didn’t receive enough love because there was a lack of ownership of it and things. I think it was a year and a half ago that we made the plans very concrete to actually build System-Z. JP? Yes? System-Z. Third generation metadata system.
The front page looks like that. It’s not that exciting especially in this big room. If you want to look at the slides later on and see what’s in there. It’s also not very exciting. The name System-Z has a story that I like a little bit because when we set out to build this thing, we didn’t know exactly what it was. It wasn’t ServiceDB, it was something else. Before we knew what it was, we didn’t want to give it a name so we came up with a really, really poor name that we knew we were going to change later on and that was System Z. Here we are. I think it’s perfectly okay. I was one of the people voting in favor of actually keeping System-Z as the name after awhile because we tried to think of something else that was better, more descriptive, but as you will see in the coming slides, it’s actually doing a very wide set of things and it’s really hard to have an accurate name for it, so System Z. These names become, you know, proper names, so they become a thing of their own and it’s okay, it’s not a description name, so I think it works.
[12:13] Okay, so some terminology illustrated by this architecture diagram of a nameless-discovery system. That’s a diagram that I actually drew about two and a half years ago, I think, when we were building this. It shows two of the main things that we have and some of the concepts that we have. I guess it’s not surprising because we’re using Nameless as one of the things that we were modeling systems on, but still, it fits nicely. We have components, which is something sort of unspecified what, something we want to track, what we want to keep metadata about and kinds of components and systems that right now are microservices, obviously that’s the most frequent kind, data stores, also kind of common, so like a Cassandra database or Postgres or whatever. A data pipeline, which is a job in Hadoop or a set of jobs in Hadoop that produce some data, or a library. A system is a collection of things that belong together. The diagram, you can probably not read it, it’s okay, but the blue box is nameless, the system, and it consists of the four yellow boxes and the gray data store, like, so those are components in the system and they are different kinds. The data store is a Cassandra database, the PowerDNS thing in the top left corner is a DNS server and the other ones are two java services and a set of command line tools.
The terms are kind of vague and a lot of things are vague or, you know, emphasized here and that’s intentional because the fact that we have autonomous squads leads to different solutions for several problems, so we don’t enforce standards. People solve problems in slightly different ways. We want to be able to fit, to have a model that can handle that. Plus, the fact that when infrastructure technology evolves, it’s not necessarily the case that all of the service system things that are out there actually keep up to pace or keep up with evolvement so we have a lot of different versions of things that are out there in our ecosystem. Oh yeah, one more thing. You can see two arrows that are incoming here. That’s things that can be called from the outside of the nameless system. That’s a concept that’s coming back a little bit later, so say the Nameless discovery component. I don’t know if you can read that. The top right one. There’s no outside arrow coming into that one. It’s a private component whereas the PowerDNS server and the Nameless registration component, they are public ones, so they are external things we’re allowed to call it.
[15:22] I’m going to show you some views, some screenshots of what System-Z that looks like. I was thinking about doing a live demo, but I’m too chicken. Main features I would say are just having somewhere where you can find out information about components and even just list components. Do dependency tracking so we can see, for a given component, which other components does it call, and which components call it. In some cases we can see that. There’s also things about managing deployment in various ways and very importantly, ownership and alerting. Who owns a component is a question that very often needs answering.
[16:10] That’s the page. Again, later on, if you download the slides, you’ll get to see what’s there. It basically shows this. A few tabs at the top. We have now selected the overview tab, but there are some other ones. Then there’s some different sections with the main one in the middle left being just general data. Key value pairs basically. If we zoom in on the bottom right part, that’s an interesting thing. It’s dependency tracking, So for this particular services it has a lot of dependencies, which is stuff that it calls. Things to say about that: you can see that two of them are orange. That means that the people, the squad that owns these things have marked them as private or haven’t marked them as public. Private is the default. This service is still calling them and is owned by a different squad so that’s some kind of warning saying, “Something is not right here, maybe those services should actually be noted as public, or you’re doing something wrong.” So like raising a flag and hopefully people will talk and work it out. Maybe you can read it, but probably not. You see there are two words there saying declared and runtime at the right of each component? That means that we have, more or less, sophisticated runtime detection systems that actually track what are the outgoing calls that services make. Which other services do they call? Then we collect that data from each instance and then we see, okay, what’s it look like? We also recommend that people should declare, “I’m going to use these services.” Then we can compare. If they’re not using a service, why do they think they would be and if they’re calling a service they didn’t think they would be using, does that make sense? In this case, it’s all good. All of the services that it’s calling are declared and they’re also detected at runtime. That’s why all the dots are green, otherwise, there would be orange dots there and the mouse over it with a tooltip saying, “this is wrong.”
[18:30] Based on this information, we can actually build up, we have a something that creates a graph files that’s included in the UI. The screenshot that I showed you is old. Nowadays it’s right there like a system map. This is the photo-generated version of the architecture drive I showed you a couple of slides ago. It shows the nameless system at the top and some of the components that are in there. It also shows what are the things that are calling into Nameless registry in this case in particular.
[19:09] All right, deployment. On the front page, on the overview page, you can see where your services deployed and what version is deployed if you provide that information in your build. In this case, there are 76 instances of searchview version 02.74. You can actually click on that link and then you’ll see all the machines where it’s running, so you can know that. Depending on what you’re using, if you’re using Helios, which is our open-source container orchestration system and you’re in Spotify and you do various things. Then you can manage deployments by System-Z, so like for each host, you can see what version is running and then you can update or change what’s running where.
[20:05] We have a concept that we call “pods” which is like a data center and if you’re service isn’t running in all the data centers then you can change. You can say if somebody wants to look it up in data center A, then send them to B. If somebody’s looking it up in B, well there are instances running there, so you’re good. Then something about the ownership information, you won’t be able to see this either, but basically, it shows what is owning it and hopefully some sort of contact methods are there. In this case, there’s a link to Slack and you can click on it and then you’ll be sent to their Slack channel and you can ask them a question if there’s any. If you have registered a PagerDuty key, that’s the red box. There’s no PagerDuty key because I’ve removed it, but if it’s there, then you can see if there’s currently an alert for that service. If it has problems, you can see that. If your service is failing, then you can find out, “Okay, it’s failing because of this downstream dependency I have is not working,” or if the service isn’t working, the downstream that you don’t own, then you can click on the bullhorn and wake the guys up who own it and say, “You’re broken, you’re breaking me, wake up and fix it.”
System-Z: Like a Swiss Army Knife
[21:25] In addition to this, you can do lots of other stuff. I’m not going to talk about these things. I’m just going to note that System-Z has, in the short time that it’s lived, has become a little bit of a Swiss Army knife, so it’s the default place to put features in that relate to managing stuff and I have some suspicion that that may not be great in the long run. It’s not bad right now, but we’ll see. We’ll probably have to do some cleaning up or some sort of structuring around this as we go along.
[22:02] How did we build this? Let’s first talk about the data model. It’s always funny I think when you look at, or when you have things like this. This took us I think about three months to come up with and it’s, you know, four boxes and some lines. It’s amazing how long you can spend thinking about something that looks super trivial when you’re done. So the core data looks like this. In addition to that, you know, you heard me talking about things like PagerDuty key and so on, and Slack channel ID, that’s not reflected here. It’s additional stuff that can be added to the model, but this is the core data model. Let’s talk a little bit about the things that are here. At the top, there’s a system and the system can obviously contain multiple components, all right? It’s also the case that the component can only belong to one system the way we model it. Then a component, if we take the down arrow first to the squad. A squad, very obviously, can own many components. There are 1100 components and 100 squads, so obviously that has to be the case. An interesting thing also is it’s a minute-to-minute relationship. Because of the way that we’re growing, by cell division. Sometimes squads can’t make up their minds about who should own component X. Is it them or them?
Well, both of them until we work it out. That’s an adaptation to the reality we have at Spotify where we had to do it this way and it’s working. Then there are two links between component and discover name, and I’m going to talk a little bit more about that. A component can register … Let’s do the other one first. A component can pretty obviously call multiple other services, so it’ll depend on multiple other discover names. That’s not surprising. It’s also not surprising that, you know, if there is a service, then it can be called by many other services, so the depends-on relationship down there it’s many to many. Then there’s the registers. A component can register a discover name, which means that this is a name, and if you look up that name, you’ll find me. A component can have many names, so it’s like aliases of some kind. What’s interesting is that that relationship is also many to many so more than one component can be discoverable through the same name. Why is that? This is why.
Migration Plan of User Database and Login Systems Squad
[24:50] It’s maybe a slight detour, but I think this is interesting. I’m going to walk you through the current migration plan that the zool squad has at Spotify. Zool squad, they own our user database and our login systems, so it’s obviously a very critical piece of infrastructure. If that goes down, then people can’t login to Spotify and that’s really bad. We don’t want that to happen. It’s also something that obviously takes, there’s a lot of volume, traffic volume, so there’s a lot of requests to it. This diagram shows clients on the one hand, and on the other side, there’s user2. Can you see that? Yeah? user2 is the service and it’s exposing itself or registering to discover names. user2 and login. The reason is that “login” handles login and user22 handles other user-related things, so the fact that we’re registering them separately means we can scale them differently. Some instances of the user2 service will register only the login interface or login discover name and some others will only register user2. The User 2 part is about reading and setting different user attributes, so is this a premium subscriber, is it not, and you know, these things. What they want to do is they want to get to a state like this. They want to split user2 into three different new services. user3, which will handle login attributes, which is about reading and setting attributes, and creates a user service which handles creating new users.
How do they do that without downtime? This is their plan. They’re going to start by having User 2 register another discover name: user2-legacy. Obviously no incoming traffic for that, but if there were any, it would be there. Then they will build a new service called user2-proxy and it will just delegate all its requests to user2-legacy and then, you know, send responses back. It’s not getting any traffic because it doesn’t have any discover names. They fix that by having user2-proxy registering user2 and login. Now we have the situation where two services, two components, are registering the same discover names. What does that mean? It means that for a client that does a lookup of user2, there’s not a 50/50, but a chance that they will get a direct reference to user2 and call that, or they will go to user2-proxy, which will send them to user2-legacy, which will send them to user2. Things will work. Eventually requests will be served by user2 and you know, the right responses will be sent and things will work. Next step is to tell user2 to no longer register these discover names, so all of the traffic goes through the proxy. Then they start building data services and register them under some discover names and at some point, when they’re ready enough with those other services, they can start having the proxy send requests through to these other new discovery names. This is really extremely useful because what they can do then is they can send through only 1% of their requests, or even less than that so they can validate performance and scalability constraints, and make sure that that works. Since they have a common point in user2-proxy where, so then actually send requests both ways, wait for both responses, compare them, see that things are as they should be, and only then send things back.
Once they’re happy with the way new services perform, clients can stop making direct calls to them. If they want to, at this stage, they could also turn off forwarding to user2-legacy, once they’re happy with the new services they do and everything as they should. Or they can wait until there are no more clients that call user2 and login, and this is what we’re seeing now. When they’re here, they can just remove all of that part and they’re done. No downtime and the confidence that things are just going to work because you can validate both the responses from the new and the old systems are the same and that the new systems can handle the performance requirements. That’s an example of how we utilize the fact that services register themselves with discover names that are different from their own names and they can have a minute-to-minute relationship.
[30:13] Cool, so here’s another slide that you can’t read. I say that with a bit of pride. The point of this, I mean, obviously look at it when you get the slides and can download them if you’re interested, but the point of this is mostly to show how the architecture of System-Z is. It’s an angular user interface with angular modules for the different components so it’s modular from the get-go. The yellow boxes represent the teams. Different teams own different parts of the UI and different backend services. See the green things are UIs and then there’s, like, Helios and Nameless, which are systems or services, back-in services. System-Z is, of course, a microservices-based thing. It’s like a meta-thing. One thing to note here is the system model component. If you see the big square block, right below it there’s something called “sys-model,” which has a lot of arrows coming into it. That’s the service that’s providing a lot of the data that we’re using so the key values that come up there, the Slack channel and these things, PagerDuty key, that all lives in YAML files in source repositories and it’s served by the sys-model service. Anything that needs data, systems metadata to work, it’ll call this model.
[32:08] We have YAML files, which have very loose, or multiple schemas, I would say. Part of the schema is the core model that I described earlier with component system discover name, squad, or owner. In addition to that, there’s multiple other schemas where different features in System-Z add their own data. For the PagerDuty integration, all it needs is a PagerDuty key. That’s a key value thing that lives in a section of this YAML file and so on. The deployment system also has like, it’s own computerization that goes in there and so on.
[32:55] The final flip with the code, so that was a design decision that we made. We want to make sure…we don’t want to put this into a database that is owned by somebody else, we want it to live together with the code that the team owns because of, you know, partly because of this. We have a lot of really dirty data in our system and one reason is the owners don’t benefit from the quality of their metadata directly. If it’s people that need to find out about your service that benefit from it. You don’t benefit from it yourself, so you know, maybe you’re not so motivated. It’s like keeping a comment up-to-date. You’re changing the code comments, maybe update them if you’re well behaved. Maybe not. Plus, the fact that we have a rapidly changing organization. You know, cell division, people moving, ownership moving, because like this what shouldn’t own it anymore and so on.
We also have pretty rapidly evolving infrastructure, so I mentioned Apollo before, which is our backend framework. We have other versions of that as well. Python-based and other things. We have at least that runtime computation because, we collect a lot of data at runtime, but since right now we have about 34 different container versions running in production, so obviously they can return. They have different capabilities and can return different kinds of data. Some of them cannot provide any at all, some can provide information about outgoing calls, some can provide about incoming calls so we have dirty data. We don’t have all the data everywhere. Also, the rapid growth and rapid changes often lead to people not being very knowledgeable about stuff that they own. That’s another thing that makes it hard to find out information about things. Also, ownership information might not be updated. Squad might be divided, new squads created, and nobody updates any of the information, so it’s pointing to a dead squad, which doesn’t exist anymore. That’s one of the constraints we have in our system design.
Ten Things You Didn’t Know About System-Z That Will Amaze You!
[35:30] We have, throughout our work with System-Z, we have been doing a lot of things to encourage people to do stuff and keep this data up-to-date. We have the warnings, which is some kind of gamification because if you’re a developer, then you want your things to look really clean. If you’re going to look at it in System-Z there are some warnings there and, you know, hopefully, there’ll be some itch and make people fix stuff. We also did this, which I think was really pretty fun. It was printed on A3 papers and taped to all the toilet doors at Spotify. 10 Things You Didn’t Know About System-Z That Will Amaze You. Migrating is super easy. You can create a new service and use any Jenkins instance and so on. Number 10: You can give the tools team, that’s the squad that built System-Z, feedback and suggest new features via slack or email. Amazing stuff.
[36:37] Coming up to the conclusion, or the wrap-up of this talk. What’s been the impact? What’s happened when we did this? It’s hard to know, obviously, but we can see that we have, you know, based on statistics that we’re collecting in the user interface. We have about 200 unique users every week and about 400 every month out of about 800 people in TPD, which is short for Technology, Product, and Design. That’s maybe a little bit irrelevant because some things you have to do through System-Z, so you can’t avoid logging in, but people use it. We think that’s good. An interesting thing as well is that if you remember back to the diagram where we had, like, a bunch of infrastructure squads. Building tools to support feature squads. These infrastructure squads are autonomous. They build their own things independently. That meant, or has meant before, that they create their own user interfaces and their own admin tools, and then they live on some URL somewhere and people don’t know where they are.
The fact that there’s now like a common point with System-Z has made it easier for feature squads to speak to developers to find the stuff that they need because they can go there. Another interesting thing we noticed is that it leads to teams building related features, infrastructure teams building related features, like an example is the team that owns hardware and provisioning, the team that owns DNS infrastructure, and the team that owns System-Z together have built like an improved version of how you do provisioning and deployment. Teams that own related features stop talking about how to build them and making the UIs more consistent and the features better for the end users. Every year since 2014, we’ve been doing a survey among the developers in New York. That’s about 150, I think, developers at Spotify in the New York office. They don’t have as much infrastructure there. M
ost of our I/O is here or infrastructure information teams are here, in Stockholm. There’s a yearly survey about what sucks about Spotify’s infrastructure. In 2015 service DB, the predecessor showed up as, I think, number 3 in what sucks. Whereas now, in 2016, System-Z is one of the things that gets the most frequent, “It’s great!” Even though we’re asking what sucks. That’s good. It’s also become a Swiss Army knife. It’s doing a lot of different things. I think the jury’s still out on whether that’s a good thing or not. It’s becoming the default place to go when you want to create a tool for feature for backend developers. I should also say that it was only last week, or two weeks ago, or something that the people building components on the client side, so like the different apps and tools on the client side, they also started thinking about using System-Z to track their stuff, so it might be that we have all of Spotify tracking on the same system at some point. Maybe.
[40:08] To conclude, if you have microservices and scale, whatever scale means. It means you have a lot of small things. Each one is easy to understand individually, but the combination is hard. How? What’s the big picture? What are even the things that you have? Do you have a list of them somewhere? If you have metadata, then that helps you understand the system. At least at Spotify we have dirty metadata, I don’t have enough information to say that that’s a fact of nature. I think it is. I think it’s very likely that that’s going to be the case almost anywhere. The fact that it may be obvious, but it’s not obvious at Spotify the fact that, we in the infrastructure squads are actually providing a single user interface for all of our customers. It’s actually helping us make better tools and it helps our users because they can use more consistent tools. All right. That’s all I had. Do you have any questions?
[41:34] Audience: Hi. How do you collect runtime information about dependencies?
PM: I mentioned Apollo, that’s our framework for building backend services and to make a fairly long story short, Apollo tracks that information. When you make a call, it registers the discover name that you’re calling to, and then, there’s a method at the end point that you query on every service, which will tell you, “I have made calls to these services or these discovery names.” It also includes other information such as what’s my current configuration, if you provided it in your build, what’s the version number, and you know, these things. More questions?
[42:32] Audience: There is some overlap in what you have done with vendors, for example, the cloud providers and also the Netflix OSS stack. Could you elaborate on the key features that you needed that were not provided elsewhere? Apart from culture.
PM: I think the volume is a little bit low because I’m having to elaborate on the key features that we…
Audience: That you felt that you needed that were not provided for example by the Netflix OSS stack.
PM: Why didn’t we pick an out of the box solution for this? I think the answer is mostly that System-Z is very much not the candidate for open source or something like that. Everything underneath it is very much like Spotify in specific infrastructure, so I don’t know that we did any very serious consideration on whether we could sort of fit some external tools. Did we? Not really? Yeah. Basically, we didn’t think that it would be feasible to retrofit an external tool because there’s so much custom stuff. I said we started building this a year and a half ago and that’s the user interface part. The things about registering metadata and tracking configurations, these things have been in existence for a long time. We actually had some stuff out there that we knew we wanted to be able to track. It just felt, the easier thing to do is to collect these things and make them available in the UI. Any more questions?
[44:30] Audience: Yeah, so what’s the size you would say, in terms of people, at which point you would consider switching to a microservice architecture? I would say if you start and you’re a really small team, then microservices are a big overhead in some ways, so since you scaled up, what kind of size did you reach?
PM: Good question. I don’t know that I have a really good answer. I think that you’re onto something when you’re saying it’s something that’s very much about the size of the team. I think sometimes, if you look at the scale cube, it only has these three dimensions, which is the scaling of your hardware or your software like the system. I think one of the big benefits of having a microservice architecture is that it’s much easier to parallelize development on the services and you can obviously do things like development individually and so on, which is harder for us when looking at the clients. The clients are not microservices-based. They are a bunch of teams that have to combine their effort and put that into one monolith. If I would guess, five teams, six, that’s when it might, when it might be some inflection point where it might start paying off. Any more questions? No? All right, then thank you so much.
About the Speaker
Currently building infrastructure and developer tools at Spotify, Petter has 20 years of experience in software development in many roles: developer, project manager, product owner, CTO, architect, etc. The last few years, he has eschewed management roles for his love of coding and has worked mostly on building large-scale distributed systems for the web.
Published at DZone with permission of Carissa Karcher , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.