Back in May, nearForm assembled an all-star lineup of speakers for Microservices Day London. One featured Adrian Trenaman, Senior Vice President of Engineering at Gilt, who discussed growing into a microservices architecture.
Fighting the Good Fight at the Hot Gates of Microservices
Introduction to Gilt
As I said, I think I’ve been sort of living the dream that Fred talked about so let me go from the general stuff to the specific, right? I’m going to talk about microservices at Gilt. We learned a huge amount, and in a sense, there’s a whole ton of war stories and things that I hope that you can take away from what we’ve learned about microservices. The analogy–I needed a sexy title for my talk. The analogy comes from the fact that effectively the microservices that run gilt.com–there’s approximately 300 of them. That’s the initial sort of analogy to the hot gate system overlay. The interesting thing there is, so what about the Persian hoard, what is it that these 300 microservices are protecting us against, or defending us against?
It turns out to understand that, you need to know a little bit about Gilt’s business model. Now, it’s interesting. If you’re in America, everyone knows who Gilt is. It’s like a household name. Very much less so in Europe. To give you an idea of what it is, effectively the business model is a flash sales model. It started in 2007. It’s been pretty interesting in that its first year it made a million dollars. Four years later, its revenue was 400 million dollars. When we were recently acquired, at that stage our revenue was approximately 700 million dollars. Pretty big organization.
What’s the idea? Well, we get our hands on designer brands, luxury fashion goods. Typically, we get them at discounted prices because perhaps it’s last season’s stuff, or it’s overstock, or whatever, but the focus is always on really high quality lux brands. We get the product in and one of the big things that’s really important is making the product look absolutely fabulous. We actually shoot all the product in our own studios. We have an entire creative division. The product comes in, the merchandise comes in and we photograph that. We are a real company so we’re actually selling real stuff. We have our own warehouses and we have our kiva, and so we deliver product.
Ultimately the reason why I’m telling you all this is because this is a very unique business model. There is no off-the-shelf tool that you can buy to run a flash sale business. Effectively, all of the software that runs these parts of the business, they’ve all actually been written in house by our teams.
A key part of the process is that we sell everyday at noon. We tell all of our members, we say, “Hey guys, today we’ve got Louis Vuitton. It’s going to be live at 12:00, come get it. First come, first serve. We have low quantities of highly desirable products available and they’re going on sale at noon.” Then this is what our customers look like. Effectively you do end up with this hoard, everyday right at noon. This is in a sense that the engineering problem that is really interesting and exciting for Gilt, I think it was very unique for us, as well, from a microservices perspective because as we moved towards microservices architecture, we didn’t know would we be able to handle this kind of load.
[3:32] What our customers really look like from an engineering perspective is this. The big, big spike that you see at 4:00 GMT, which is noon in the states, that’s effectively our traffic load on one of our load bouncers. Handling this pulse load has been one of the unique engineering problems that we’ve learned to solve at Gilt. We’ve learned it, in a sense, the hard way. As our customer base, and our business went viral, effectively we had outages. It’s interesting. It’s like, how do you engineer?
How do you build everything so that it’s super scalable?
We started with a world that was all Ruby and that was good. Then, we got to the point where Ruby simply couldn’t handle it and we actually had some great Ruby engineers working on the site, but the reality was is it simply couldn’t do it. We adopted a technology called Java. Effectively, over time, we were able to use this technology to get the kind of performance that we were looking for. This is a very interesting diagram, as well, just from a failure perspective. As you can imagine, our revenue at the beginning time of the day actually tracks this line as well. Failure is not linear. If we fail at these times of the day, that’s okay, but if we fail around this time, just for a couple of hours afternoon, that’s actually of course really damaging. It’s material to our business. Being reliable, being scalable is just massively important.
We think about the hype cycle, adopting microservices, and this is the classic kind of found it on Google, hype cycle of the curve. Technology trigger, something happens, everyone gets excited. Everyone’s going, “Oh my God, this is amazing,” and then all of a sudden, after the peak, you get to this trough of disillusionment. The trough of disillusionment is where you go, “Oh my God. This is dreadful. It’s not what we thought it was.” If you’re lucky you can get through the trough to the slope of enlightenment, and then hopefully the plateau of productivity. Gilt is now here with respect to microservices.
This is effectively how it all panned out:
“Oh, no. We’ve got a monolith. It’s terrible. What do we do?” Maybe these microservices can help, so we’re very, very excited. We’re kind of like, “How do we adopt this new architecture to try and actually go faster?”
“Yay. We’re doing great. This is amazing.”
Then in 2013, we’re like, “We are at the top of the mountain. This is brilliant. We have 300 services.” The interesting point there is, as well, as part of that thing, this is a chart of the number of services over time on Gilt.com. Effectively, you do see a point of inflection around here, which is January 2012. It’s a very interesting point. That was the point when we officially decided we were a Scala house. We’d gone from Ruby, to Java, we landed on Scala and we absolutely loved it. In a sense that adoption of a new language, a new functional programming language and everything that gave us, that’s what kind of led to the point of inflection there. We were at the top of the mountain.
[6:59] Holy crap. Look at all these services. It’s a hard world to live in when you’ve got 300 things out there to monitor, to manage, to understand. There really was a moment where you’re going, “This is actually hard.” There are a whole ton of problems that we didn’t know we were going to have. One of the big ones, and it’s still one that we’re kind of coming close to the end of actually, is a staging environment. When you have one big service, it’s easy to create staging environments, or pre-prod environments. When you’ve got 300 of them, trying to replicate those 300 services into a staging environment is next on impossible. Believe me, we’ve tried it. We’ve thrown ourselves at it. In the end, what the interesting thing there is we’ve learned basically to stop using staging environments and simple testing production, and figure out how to do, I mean that in all seriousness, how do you test safely in production? It’s an incredible skill and technique to learn.
We kind of say, “Let’s get a handle on this. How do we look at what we’ve created and continue to innovate at the pace that we’ve liked?” Now this is effectively where we are now, but I think probably by January or February was this weird sensation of, “Wow. We actually know how to do our job,” which is kind of great. Then it all changes.
Lessons from the Slope
[8:34] What I want to talk about. Lessons from the slope, that’s the slope on the way to the plateau. First thing is is that microservices is very much an emergent architecture. Fred alluded to developer anarchy. Effectively, when we were in microservices we completely decentralized all of our decision making. You’re probably used to in many organizations the idea you’ve got a brainiac chief architect who tells everybody what to do, and everybody bows and goes and does it. We don’t have that. We’ve completely moved all of the decision making on architecture, and on technology choice to the engineers, and to the teams themselves. When you do that then, you do end up with a big pool of stuff. Then you have to figure out, how do I take that pool of stuff and understand it? How do I think about it? I’ll talk about that.
I’m going to talk about how to manage ownership and risk. Managing ownership, it was a critical problem, and interestingly, this was a bottom-up problem. Our engineers were unhappy that they didn’t know who owned what? Who is responsible for this service? “I may have written or committed to it one day, but it doesn’t mean I should be the guy on call.” How do we figure out the ownership piece? A very important technical note about making your clients thin, and then I’ll talk a little bit about then avoiding snowflakes.
It is incredibly hard to think of architecture as just a line. Literally, we became aware that effectively we had approximately 265 services. Here’s our architecture. Bang. That’s a list. How do you reason about that? I mean, it’s just like I don’t even know where to start. The average human mind can hold plus or minus 2 things in it’s head at any moment. Here we are with close to 300. Just impossible to reason about it. We looked and we said, “How do we … ?” I was actually very fortunate here, I’ve actually gone to Tokyo for 2 years, and I came back and I was faced with this problem. How do I understand these services?
The Gilt Genome Project
We used a technology called a spreadsheet. Now, it’s actually very interesting because we thought that we have 265 services, and somebody said, “I know, let’s build a service to track our services.” At that point we knew there was something not quite right. Instead, it turns out spreadsheets a great, collaboration built in, they’re Turing complete, you can practically do anything you want with it. We actually took the brave move of simply saying, “Let’s build a list. What are the services that we have and who owns them?” This was actually the first step towards going, “Okay actually. We’re getting a sort of a handle on this.”
We then looked and said, “It’s interesting. Why don’t we add a little bit of classification to this so that we can understand more about our systems,” and we just said, “Let’s play with a 3-level taxonomy.” Main functional area of the business that this service is relating to, the system that it’s involved in, and then a subsystem. Then it turns out that by just going through the list and saying, “Where do these things belong?” We actually inferred an emergent architecture. That was actually a really, really nice thing.
Some things that we learned
Some services are incredibly simple. Other services are actually deceptive in that they–you think they’re simple, but actually they’re not. What is the evidence to support that the services are simple? This is a logarithmic graph here of the size of our services–the number of lines of code, and it turns out most of our repos have approximately 2,048 lines of code. That’s all the code. That’s the builds code, the Java code, the model code, the Scala, whatever. It’s not as small as some organizations, but it’s actually reasonably good.
This is great. You can drop into a repo and you very quickly understand what the hell the service is doing. That really nice. As well in terms of the number of files in any repo for any service, we’ve got typically less than 32 files. Again, that’s everything. Again, that’s actually pretty good. Our services are generally quite small. There are some exceptions. What became really interesting though in terms of the taxonomy approach was this, what we did effectively a pivot table, would you believe it. We realized, “Hey, there’s only one service here that does authentication on the back office systems.” This is the overall back office architecture here. That’s kind of nice. There’s one for billing, that’s great. You look through, and when the counts get high, 9, or down here, order processing, 8, this is where you realize that actually these services are kind of deceptive. In their own you can’t understand them. Any particular service in the order processing suite, or cluster only makes sense when considered in tandem with the other services that it works with.
Effectively, when you do microservices, you kind of explode out the complexity. You don’t get rid of it, you just move it around a little bit. What you’ve got to hope is that you’ve moved that complexity into a constellation of services that makes it easier to understand, but one of our things was, “Okay, we need to make sure then that we have the right people.” In a sense we do need to do a little bit of architectural thinking and documentation around some of these areas because otherwise going into one repository, you’re not going to get the whole picture. It was interesting. Very, very simple approach. Apply taxonomy, see what happens, and then effectively we’re going to get some kind of emergent architecture.
Manage ownership and risk
[14:26] One of our things as well, we talk about architecture and exploring the way we architect, we’ve come on this organizational unit called the department, which actually has solved an enormous amount of our problems. Fred talked about organizations and how you structure yourselves. I think it is as critical to your success as the technology choice is. What we’ve learned to do is stop making decisions at this grand level, push it down, but don’t go down to, like 100 engineers, if you push the decision making to each engineer, then what you’re going to end up with 100 different solutions. How do you get the right level of consensus? We get that at the department.
[15:09] Let me talk about ownership and then I’ll round out on my ideas for the department. What we said is, based on this list, this genome, we said, “Okay, who owns the software?” We put these departments, departments are approximately 16 to 20 people in size, and all engineers. There’s a director. The director is on the line. He owns his estate. Maybe he’s got 50 or 60 services. It’s up to him or her to make sure that this is the way they want it to be. They have full accountability for what they have. Then the director is kind of assigned the services that the landscape to the various teams so, “Team Rubris, you own these 12 things. You must make these great.” We’ve kind of distributed the problem there.
One of the other things then is the teams, the software teams themselves, they’re responsible for building and running their services. Effectively, what we’ve done at Gilt is we’ve built in pure DevOps. We’ve moved everything to the cloud. Ultimately the teams build and run their software. This is different from Google’s SRE model where one team is a product team that builds something and they hand it over to a different team to run it. We’ve landed on it’s really important for the teams to run and own their own stuff in production. We basically end up with a racking style kind of structure here that turns out to be incredibly powerful. Teams are responsible for building and running their services. They get given KPI’s. “Solve this problem, we want to increase the number of clicks on this page.” They go like crazy at it and effectively they’re responsible for owning and running the services they use to solve the problem.
Directors are accountable. This is kind of a level of leadership like over a team of teams. What we found, which was really great, was we built an entirely powerless entity called the Arc Board. High influence, but 0 power, made up of people from all the departments who are senior engineers to try and figure out what kind of consensus we need at the high level. It’s really great. The Architecture Board makes amazing recommendations, but you can completely ignore them. You can do whatever you like. The key thing is that if you do choose to ignore the Architecture Board, you’re on your own and that’s the price of doing that. What we’ve actually found is having this sort of virtual Architecture Board has really worked for us. I mean it’s one of my favorite kind of meetings of the week is the Architecture Board, just the conversations are incredible.
How do you create an Architecture Board?
[17:50] Effectively we open-sourced our constitution. What’s actually really nice then, when you want to change the constitution or the makeup of the Arc Board, you submit a pull request. It’s great. Likewise, what these guys come up with, their standards and recommendations, and as we discussed what these standards and recommendations are, we use actually a thought works style tech radar. Effectively again, these are all tracked on a GitHub repository. These 2 are both public so you’re more than welcome to take a look at those.
Interesting numbers, 5 plus or minus 2, perfect size for a team. We do not need engineering teams larger than this. I think that’s kind of a really important learning that we have found. And interestingly as well, when you’ve got a team like this, we do have team leads. Team leads code about 85, 90% of their time. They are not a manager. We just don’t have management positions. 20 plus or minus 4, this is actually something similar to the size, or the optimal size for a classroom apparently. This is the perfect size for a department. You can get these people into a room, and you can say, “Guys, here’s what we need to do,” and actually within a department you can have mobility between teams and people feel good about it. They don’t feel too shuffled. This actually works out really, really well for us.
30%, this is the amount of time that we try and get for the split between your operations work as a team, because we’ve kind of embedded ownership, and ownership of production into the teams. What we’ve landed on is about 30% of the time is spent on routine maintenance and operational stuff. We try and get to a point where like 70%, sometimes 60, it’ll change per team, is actually spent on new interesting stuff.
[19:42] This has been also one of the other very, very interesting things that we’ve done is we’ve said, all these services. Then you’ve got all these engineers over here. There are less engineers than services, and many of the people who’ve written the services have moved on to new jobs. How do you handle the risk? What we did was we took every service and we classified each service by simply saying, “Is it active?” Green is this service is under active development, we absolutely love it. Passive is we’re not really developing on this, but it looks pretty good and we’re okay with it, we feel good about it. Then red is at risk. At risk basically means somebody told me to own this service. “I’ll get it up and running for you, but realistically I’m not offering any guarantees or any SLAs.”
We’ve formed these donuts to track our ownership and it turned out we hit, this is really brilliant over time. July, you see a black line there on classified. You get an idea. Quarter of the code base is under active code development. 43.9 is under passive, and this was like a real red alert for us. Holy crap, there’s almost a third of our stack that we don’t feel good about. That was a real eye opener. We set out then to weave in our ownership needs, and derisking this matrix here with our attack strategy. September, we’ve gotten rid of the black ones, we have a better idea. October of 2015, we’ve moved our unknowns to 24.0%, and this is just amazing, February of this year we got down to 18.4. This is a great way of tracking and understanding the level of risk you have. It’s actually interesting, when you do microservices, you can build something like this. When you’ve just got a monolith, it’s just one thing that’s either red, or green, or orange.
[21:42] It turns out, anyway, that we thought about the architectural areas, these are the main areas of Gilt Tech. We built departments around these areas around ownership. We’ve lots of flexibility within the departments in terms of team structure. Somebody in a meeting once said, “You’ve just pulled an Inverse Conway manoeuvre.” It turns out I think we have. Effectively we’ve modeled the organization after the architecture that we wanted to achieve. That actually felt really, really good.
Make your clients thin
[22:13] A couple of quick notes before I get kicked off. Thin clients, one of the things that we did, which was wrong, when we wrote a service, every service, every engineer would write the client code and make a JAR that people could use to consume the service. Under that client code, they put all sorts of diddies and all sorts of stuff. Consumers would use the client JAR and then away they’d go. This is hell. The reason why is because the client code implicitly pulls in the service depends is it a repo. The consumer has their own consumer dependencies and that’s when you get clashes in terms of dependency management inversions. Now this sounds like it’s more isolated, just suck it up and deal with it. When you’ve got 300 services that you’ve got to upgrade to a new version of a JAR, and you have to stop your entire tech team for 2 months to try and figure out how we’re going to get this massive upgrade done, that’s just not great. This was a mistake.
What you really need to do is something like this. We’ve adopted apidoc, which is really good. You define your service in a technology agnostic language, and then you basically generate client code. The trick is that when you generate code, this code should have no dependencies whatsoever. Effectively, by doing this, by having these client codes be zero dependency, it’s absolutely decoupled everything for us in terms of that kind of risk. I would say, keep your clients absolutely as thin as possible when you do this.
[23:47] Last point. I’ll be really quick here is on the building of snowflakes. Effectively, a meetup of WAWS meetup, and it turns out that we have, I think, 7 different ways to deploy code to production. That’s just crazy. The complexity of managing 7 different pipelines. Effectively what we had done was every team reinvented their own solution for how to deploy their services to Amazon, and some of these solutions actually involved quite a lot of code to actually figure out the deployment pipeline. You have to ask yourself, “Really? Did this code help us sell more dresses? Should we have been doing this?”
Andre’s Rule of 6 now applies
Effectively, if you wait long enough, typically 6 months, Amazon will have implemented the solution that you’re looking for. You’re actually better to put up with it and then effectively, 6 months later, you’ll have a solution. It turns out now, actually in terms of deployment, we discovered, “Hey, this is actually really interesting.” We are now basically using as much as the Amazon tooling as possible, and we need the lightest kind of layer over this in terms of a small Python framework to layer it all together. We call that Nova and that’s also just recently been open-sourced here. If anyone’s interested in that, if you’re in an AWS world, I think you should have a look at that. It uses all the goodness you’d expect, Docker, and all the great stuff. What it actually supports in terms of roll outs is amazing. It’s things like having dark canaries and canary releases, that’s what let’s us test in production. This stuff is really, really great.
Thank you so much.