DevOps at Nike: There Is No Finish Line
DevOps at Nike: There Is No Finish Line
For an example of a digital transformation and DevOps adoption, look no further than Nike in this presentation from the 2017 DevOps Enterprise Summit.
Join the DZone community and get the full member experience.Join For Free
The following is an excerpt from a presentation by Ron Forrester and Scott Boecker from Nike, titled “DevOps at Nike: There is No Finish Line.”
You can watch the video of the presentation, which was originally delivered at the 2017 DevOps Enterprise Summit in San Francisco.
At Nike, we are on a digital transformation, so we want to spend a little bit of time today talking a little bit about where we’ve come from, where we are now, and give you a little bit of taste of where we’re headed.
Our mission has been to bring inspiration and innovation to every athlete in the world. But that’s with a little asterisk because, at Nike, we define an athlete as anyone with a body. If you have a body, you’re an athlete. (So it expands our consumer base very well.)
But, as you all know, on and off the pitch, the field, the court, our world is evolving and changing at a pace that we’ve never seen before.
We are communicating constantly in new ways. Our expectations for personalized service have never been greater. We pay for goods with our phone, we pay our friends back without cash. Commerce happens just about anywhere, everywhere, and at any time, and the technology to support the Internet of Things is changing to support all of this.
These trends were apparent in both 2014 and 2015, and we knew that in order to keep up with our athletes’ needs, we had to change how we built and operated digital experiences. That change started with our team.
First, we focused on creating a very tight partnership between product and engineering, with a shared vision for exceeding our consumer expectations.
Then we looked for a business problem that directly impacted our consumers, our business, and would also be able to prove how a new platform and a way of working could bring scale and leverage to the business globally.
We Did Not Have to Look Far
You might not be aware, but, basically, on any given Saturday, at 7:00 AM Pacific, we often release new shoes. They’re very highly coveted and have hundreds of thousands, if not millions, of consumers simultaneously trying to cop the newest Jordans, Dunks, Kobes, or LeBron’s.
My prior history was at Ticketmaster, so not unlike the new U2 concert going on sale, we have a very, very similar challenge in how we scale and how we serve that demand that hits all simultaneously.
The experience that used to happen on a given Saturday, when all of these people hit, you’d come in super excited, on time, find your shoe, go into the process, and this is what you’d get:
In order to manage that, we were queuing and throttling everyone at the edge. Therefore, managing the in the consumer experience and how they would come through, get the inventory, go out.
But, if you bounced, or if you left this, you were out of line. Imagine being on the phone, sitting there waiting, and you get a call. If you take the call, you’re out of the line.
We had moments of these with high demand, high consumer base showing up, where we’d have some of these lines would last anywhere from two to three hours. We would literally have people waiting in a line like this for hours.
Unfortunately, because the demand for the shoes far outweighs the supply, the result was often this.
As you can imagine, this is not an ideal consumer experience. This does not leave people feeling the warm and fuzzy success of getting that shoe they needed. We knew we had a challenge and a problem that we had to solve for our consumers, for our brand, and fundamentally for the business to be able to meet the needs that were clearly there.
We Pulled Together With Product and Technology
We created shared principles and we created our own hierarchy of needs.
We said, in order for this to succeed, it fundamentally needs to be:
- Reliable, secure, and stable foundation: If the site’s not working, if the app’s not working, everything else is basically irrelevant.
- We wanted it to be fair. Fairness is a really large challenge when you’re talking about minimum units for a lot of people. So there was a process by which we would understand the selection process and the different ways to buy depending on the inventory. We created unique lottery systems, and then we created unique first in, first outlines to make based on what we had.
- It’s got to be fast. When we were developing here, we wanted to develop a set of services that would be utilized on the web, in an iOS app, and in an Android app simultaneously. So they had to be fast for people on different connections and with different devices.
Once you get all that right, you get to make it fun. That was the start on how we set this up.
A Little Bit About Where Nike Digital Was Six Years Ago
At this time, the culture was very much one of vendors, agencies, contractors, data centers. We were doing amazing business. We had transformed Nike, even at that time, from a very physical product-oriented company to the beginning of a digital offense.
The key to that, really, was looking to see how did we want to transform our platform?
At the time, the platform was a very monolithic, driven by contractors, lots of big iron. We’d throw solution architects at it, and we’d have nike.com at the end of the day! One thing is that monoliths get a lot of bad raps. The monolithic system that we had at the time wasn’t really the problem. The problem was that we couldn’t scale our monolithic system and we couldn’t innovate on it. It was serving our business needs, it was driving a lot of revenue, but it wouldn’t take us to the next level.
How Do We Want to Transform Our Digital Footprint in the World?
The first thing we needed to do was come up with a set of aspirations around how we wanted to move forward.
That started with internalizing a lot of what our product owners were talking about, what they wanted, what they had, what they saw as the future for our consumers.
Being engineers, we tried to decompose that into the simplest statement possible, which was “premium experiences at scale.” That was our mantra. “Scale” had a lot of different meanings. It wasn’t just infrastructure scale. It wasn’t just scaling to our consumer. It was also scaling to the needs of innovating on that platform. It was scaling the number of internal experiences we had on that platform, etc. So that was a key insight at the time.
The Next Step to Move Forward Was That We Needed to Go Out and Hire Some Talent
Because we didn’t have a strong in-house engineering team, we needed to attract people to Nike, let them know that we’re not just a shoe company, we’re not just an apparel company, we’re serious about digital.
We started to do that, and we made some great hires in the early days. As a matter of fact, starting about four years ago, we were hiring an average of about 250 people into our technology group a year, which was pretty astounding for a shoe company.
Then We Started to Look Around the Industry for Luminaries
People we could learn from and ride on the success of. Some of the obvious ones that came to the front for us were Netflix, especially from a big infrastructure, high-scale standpoint, and then, Spotify. We tended to look to Spotify for that cultural touchstone, how they’ve built their teams, and how they operated internally.
Then We Talked About, “Okay, What Do We Want Our DNA to Be?”
Nike uses the term “DNA” a lot. What do we want, as we move into this era of developing software ourselves, how do we want to look at it? There was a lot.
We sat for days and days in rooms with the top 30 technologists in the company, which sounds like a small number, and it was back then. We had lots of bullet points, but these are some of the high-level talking points or aspirations that we wanted to have:
- We’re going to make a massive shift to in-house development.
- The power of the platform we wanted to create was that we could do small releases every day. Again, before this point at Nike, we were doing, at best, monthly releases and, oftentimes, it was more like quarterly releases. And we’d have maybe two big enterprise releases a year. We could not move at the speed of the business, so small releases every day.
- Open source first. We used open source back then, a little bit, but it wasn’t a concerted effort to always look at open source first before we started to develop our own products or before we started to buy things.
- Publish. If you guys have been to engineering.nike.com, you see that we have a really great presence right now in the open source community with a lot of great projects across different types of platforms, mobile, back-end and front-end.
- Then, consistent engineering principles across all the different teams that we’re working. These are things like security, maintainability, reliability, performance. We had seven of them, and we purposely put features at the end,, so everything else needed to come before features.
- Product and engineering own the solution, two in a box. We have to own it together. We have to understand it. Our success is tied to it.
I’m Going to Talk About a Few Words and What They Meant at Nike
Agile: Again, four or five years ago at Nike, this was fairly disruptive. We had a very IT, enterprise-driven culture, lots of giant requirement documents, all of that. But we needed to embrace Agile.
For those of you who have started Agile in an enterprise environment, you know that Agile often to the business means, ‘yeah, we can change anything any time we want and we don’t have to tell you when.’ Maybe that isn’t how it works nuts and bolts, mechanically, but we did have to adapt to that environment. We had to adapt our processes, our ceremonies, and the way we worked to make that possible at Nike. Regardless of how the business internalized the Agile mindset.
Pizza squads: Amazon, Spotify, there are other companies we were talking about, too, pizza squads or our teams and stuff like that, and we embrace that as well.
Sprints: This is a true story. Ad elements of the Nike business thought, “Oh my God, this is amazing, the engineers are embracing our legacy running culture, and they’re using terminology that means something to us.” They had no idea that this was already in place, Which was amazing.
The way we work in our day-to-day process is we do six one-week sprints that we put into milestones. The sixth week is generally supposed to be for things like innovation, stretch, a little bit of planning, etc. The reason we settled on that is that if you’re going to do one-week sprints, oftentimes the ceremonies around Scrum and Agile can overtake the amount of time you’re spending actually doing work. We wanted to accelerate that and make sure that we were putting the ceremonies in one place at the milestone level. We would do all of our planning at that point and then let people actually sprint forward. But we still had our stories and epics broken up into the sprint boundaries.
Cloud data. Wanted to get out of data centers. Data centers were killing us. We’re very seasonal. We couldn’t scale without actually adding a bunch of iron, and then we were paying for that iron the rest of the year, right, so it made sense to get into the cloud.
Automation. Clearly, a key to unlocking everything, and DevOps, in my mind, is nothing without automation.
Continuous delivery. At a big brand, that’s very valuable. This sounds like, “Oh my God, you’re going to push changes all the time, and I don’t get to tell you when, or look at them, or stop them,” so that was an interesting disruption for our business as well. But once the automation gets in place and we can prove quality through the pipeline and etc. the business gets pretty excited about it. They want to have an idea one night, get up in the next morning and have it deployed by the end of the day, worst case.
Canary deploys. This is actually a really hard problem. It’s not so hard if you have one experience that you’re supporting with a platform, but it gets really difficult when you have many experiences on one platform, that platform is global, or you’re trying to follow the sun with that platform. I think it’s critical to the way we need to work in the future. I’ll tell you we haven’t cracked the problem yet, but we’re working hard on how to get there for the way we do business.
Decentralized quality. This one’s fun. Typically, a centralized quality organization gets to be the one that is responsible for the quality of your product. They don’t like that idea, right, but that’s what happens. It’s sort of that one throat to choke. When we talked about, to our partners and people throughout the company, our stakeholders, “Hey, we’re going decentralize quality, we’re not going to have a QA anymore, the engineers are going to own quality, product’s going to own quality, we’re all going to own quality together,” they’re like, “Well, who do we go to when the product is broken? We can’t really talk to the engineers because they’re super fragile and emotional. If we tell them that it’s their fault, they’re probably going to quit and they’ll stop doing their magic, right?” So that was a fun one, but we’re well, well through that, and I think it’s working really well. There are some pockets where we still have a bit of centralized quality. We do a bit more of a center of practice or a community of practice, and we don’t have that anymore, the engineers own that.
Monitoring and alerting. Again, huge unlock for DevOps culture and just doing your business better.
DevOps. Before we talked about DevOps, a lot of this was very disruptive to a company like Nike in the way we did our digital business. When we started talking about monitoring and alerting really carefully, the engineers knew something was up. They were like, “Yeah, we do need monitoring and alerting, but do we really need it at that level?” They started to sense that something was going on.
Then when we said “DevOps,” they were like, “Timeout. What is going on here? Why can’t we just have a DevOps team? We already have a production support team. We could just call them DevOps, and it’ll be fine, right? We don’t have to do all that stuff, right?”
This was a very culturally disruptive idea, and that’s what I really want to highlight to you guys. But it’s not about technology, it’s not about the tools, it’s not even so much about the process. It’s more about this cultural accountability for the work that you do.
I try to think about it as we spend a lot of decomposing our technical problems into pieces that we can solve with software. Part of this is decomposing our organization into the simplest autonomous units, which are the people who do the work and giving them all the power that they need to do that work but also giving them responsibility and accountability for how they do it and its quality.
But a lot of legitimate questions come back. It’s fair to say, “Why can’t we have a DevOps team?” The engineers are talking about, “Well, what does production support do now if we’re responsible for what’s in production? What happens when my services deploy to five regions around the world? Am I really on the hook for when it goes down in all of those regions and has consumer impact? How do I follow the sun? When are we going to get our features done? I’m spending a lot of time deploying infrastructure and managing it. When do we get our features done?”
Our answer back, and it was a cultural mechanism by which we wanted to affect this change, was, “Sorry, there’s not going to be a DevOps team. There are no DevOps roles. There are no people with DevOps in their title. DevOps is accountability, right?”
That’s just a forcing function. It’s something that you have to pull the band-aid off and say, “That’s what it is.” There’s lots of stuff for production support to do. Production support can change an organization that has a longer view of what’s going on with your infrastructure and how consumers use it. They can collaborate with you to create and use their experience to create the dashboards and the monitoring and alerting that they need. There’s a lot of value in that production support organization. It doesn’t go away just because engineers own the infrastructure they deploy to.
More importantly for me, is the idea of a DevOps team is really just the idea of putting another wall in place, where people can throw stuff over it and say, “It’s no longer my responsibility.” We have to squash that. It can’t be part of the way we work.
So Over Time
Just wanted to put up a few things that I think are still on our mind and things that we’ll really never stop working on.
If you saw the title of this, it’s “No Finish Line.” There’s no finish line. That’s a famous Nike statement at work, where we’re never done. There are finish lines for each little race, but there are lots of races, and so we internalize that as we’ve got to keep improving, we never stop, we just go after every little detail that we can.
These are some spaces that we’re working in:
- Better monitoring and alerting. I think what we’re looking for there is more finer-grained monitoring and alerting. What happens a lot when you have a platform that’s used by many experiences when one of the experiences starts to seem to fail, (even if it’s a platform issue,) it’s obviously going to be visible in the experience. The first people to get paged are the experience engineers. Experience engineers get up, they spend an hour looking through what’s going on and they’re like, “Wow, this has nothing to do with the experience. This is a problem with inventory or payment or whatever at the service level.” The question becomes, how do we get a monitoring and alerting system in place where we’re alerting carefully and accurately that the right people to a problem that’s happening? Because what happens is the interface or experience teams become the L1 support and they get really tired.
- Dependency management on a multi-experience platform. The next thing is, hey, we have a great platform. Can we all use it now? The fact with dependency management is you can no longer just deploy code and know that you’re only going to impact one experience. You’re impacting 15 experiences globally, so how do we make sure that we have the right test automation in place, that we understand the dependencies between all of the platform pieces?
- Tool consistency across squads. For me, that’s mostly, let’s stop developers working on commodity stuff. I don’t need another Jenkins. I don’t need another pipeline. I don’t need another test automation framework. They’re out there. There’s plenty to use. Let’s stop working on commodity. Let’s create features. That’s what our consumers want, and that’s what we need to do. It’s more about gluing that stuff together.
- Internal telemetry automation. For me, this is less about monitoring and alerting at the service level. It’s more about, on those Saturday shoe releases, are we scaled correctly? Did everything return to nominal after the previous launch? What is our readiness for that? It’s not just the readiness of our services and systems. It’s readiness for our operations, our content, all the things that go into a successful launch. It’s surfacing that view into the infrastructure and into the business overall operationally, and we need to spend more time doing that.
I think that’s just a quick view of where we’re going.
To Pull Back to the Story We Started With…
In terms of where we were, implementing all of the great things that these teams have been doing, what we did together to drive a cultural change, to organize the teams between product, design, engineering, how did that net out?
We ended up launching a new platform for what we call our sneakers platform, which is the sneakers website and then the sneakers iOS and Android. We had that running on a new platform and old platform simultaneously for about six to nine months, where we were able to test our way into it. That culminated in Christmas of 2016 with the new Air Jordan 11 inspired by Space Jam. With this, we basically had the largest and most successful shoe launch in the company’s history. We went from moving, up to three hours to minutes, and that is now the current platform that we have built on.
If you go back to the principles and the hierarchy, we made it stable. We made it secure. We introduced the fairness with new ways of buying, and we satisfied the speed.
We were able to deliver on all that, and that has brought a core way in which we are moving forward with how we operate and re-platforming the entire nike.com platform, and it has brought tremendous business value to the company.
But It's Not the End
Again, there is no finish line. During that time, we acquired a company out of New York that focused on community line services. It’s called Virgin Mega. They focused on this idea that people are standing in lines, whether it’s for ticketing, for concerts, at festivals, or for shoes. And during that time there’s a community and you have an opportunity to engage that community and inspire that community.
We’ve stood up a new digital studio in New York. We have remote teams working on this, connected back to our groups in Beaverton and Portland, working together on this platform at a global scale.
Attend the DevOps Enterprise Summit in Las Vegas, October 22-24, 2018!
Published at DZone with permission of Gene Kim , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.