How Expedia Uses NGINX Plus for Cloud Migration at Scale
This post is adapted from a presentation delivered at nginx.conf 2016 by Dave Drinkle of Expedia, Inc.
Join the DZone community and get the full member experience.Join For Free
This post is adapted from a presentation delivered at nginx.conf 2016 by Dave Drinkle of Expedia, Inc. You can view a recording of the presentation on YouTube.
Dave Drinkle: My name’s Dave Drinkle. I’m a Senior Software Engineer with Expedia. I’ve been with Expedia for about six years, and for the last three years I’ve been working specifically on NGINX configuration for routing traffic through our front door. So what I want to do today is just walk through some of the tips and tricks that we’ve learned.
1:19 Three Pillars of Cloud Migration
Today, I want to take about the three pillars that we’ve built our cloud migration on.
The first is multi‑region resiliency. This is how we build cross‑regional failover into our NGINX configurations so that if something goes down in one region, we auto‑fail over to another one. We’ll talk about how we do that. With NGINX it’s pretty straightforward to do all of these things, so I just wanted to bring them to light.
The second thing I want to touch on is avoiding the knife edge. At Expedia, we really try to focus on making slow changes. If we’ve put a new app or microservice out there in the cloud, we want to do that in a slow, controlled manner, and we want to have a way to be able to roll that back as quickly as possible if needed, as well.
The last thing I want to talk about is reacting to errors. And this is how can we set up our proxy to react to the errors that are coming back from our microservices or apps.
Before I get into this, I want to give you guys a configuration that’s actually pretty functional. I can’t touch on everything because of time, but before I dive into those three pillars, I want to give us a starting point.
3:00 Traffic Through NGINX Before Cloud Migration
This is basically what Expedia’s traffic routing looked like before we really went to the cloud.
Traffic would come in through the browser, it would hit our CDN and then the traffic would go to our data centers. Pretty straightforward.
The first step we have to do when we’re moving to the cloud is getting our NGINX cluster out there and our traffic through it. We want to put NGINX in there as a man in the middle, but we still want to route that traffic back to the data center.
3:23 Traffic Through NGINX After Cloud Migration
This is where we’re going. The CDN is going to break that traffic up, route it into our multiple regions. But instead of the traffic going into our microservices, we’re going to route it all the way back to the data center.
This where I want to start and it gets us into our basic configuration.
3:43 Basic Configuration
You’re also going to need your access logs, and your error logs, and all your proxy parameters and all that kind of stuff, but this is the basic configuration for getting your data center set up.
We’ve set it up with two data centers that are weighted 70/30. We’ve got
fail_timeout set up and then we’ve added the
resolve parameter [on the
server directives], which is an NGINX Plus only feature – that ensures that our DNS for our datacenters are getting resolved based on the
resolver configuration that we have.
server configuration is pretty straightforward. There’s nothing too complex going on here. That
location block there is going to take all of the traffic that doesn’t have another route defined, and we’re going to route it to our data center with that
A good practice is to always set the
proxy_set_header for the
So this isn’t too complex, it’s just a base configuration that we’re going to build on as we go.
4:58 Multi-Region Resiliency
Why do we need multi-region resiliency?
At Expedia, we focus on two main things. First is fault tolerance. We want to be sure that our customers are always getting a response if at all possible. That may mean we need to route traffic from one region to another.
We also want to reduce latency. If your microservice can be built so that it doesn’t have to phone home to the data center, then you can get reduced latency, by having your microservices deployed to as many regions as you can. And that means putting your NGINX in place there.
For the actual resiliency piece, we’ll use NGINX Plus health checks for the auto‑failover.
5:45 Multi-Region Resiliency In Pictures
Here’s a diagram to illustrate that visually. You can see all our traffic is coming in from our CDN. It’s going into our regional NGINX clusters. [Traffic coming into] the NGINX cluster in Region 1 is getting routed to the app in Region 1, and the NGINX clusters in Region 2 are routing it to the app in Region 2.
But what happens if we have a problem with that app?
What we want to get to is this diagram. If the app in Region 1 fails, the NGINX cluster in Region 1 is going to stop sending traffic to the app in Region 1 and route it over to Region 2.
Now, there’s some networking layer stuff here we have to make sure we have in place. You’ve got to make sure that you can actually talk from Region 1 to Region 2.
6:50 Routing to Your App
The configuration we will show as we go along will be the app configuration, and it’s very similar to what I just showed you with the data center.
We have our [
upstream block called] app_upstream. We have one primary server, the top one there, app.region1. It’s set up with the
resolve directives. That’s going to take all of our traffic.
Then the second server is just a backup. It’s set up to go over to our Region 2. The key there is, if we get hard failures with
500s or that kind of thing from the Region 1 server, then NGINX will fail over. That’s what the
max_fails is all about.
And then we’re also going to set up our health checks at the bottom of this config. So, the next section there is the
match, that tells NGINX what we consider to be a valid response from a health check. We’re not really too concerned about what’s in the response. We’re just saying that if it’s got a status between
399, we consider it a valid response for the health check.
In the configuration for our actual application, we’re going to say that if you make a request on /myapp, that’s going to go to our application in the cloud. That’s what this
proxy_pass line is all about.
You’ll notice here that I’ve actually split out the health check. There’s a couple reasons I’ve done this. One, in the next couple slides, we’re going to mess with this upstream, so I don’t want to have my health checks be tied directly to this particular
The other reason to do this is when you have multiple URL paths that you want to route to a single microservice. This way we can have multiple paths and multiple
location blocks within our configuration, and we don’t have to have the health‑check configuration multiple times within there.
So this works well. We just put health checks, and then whatever the app is that we’re health checking and we configure it that way.
The reason this works is because when NGINX does health checks, it checks the upstream itself. So if you have two
location blocks that are both using the same upstream, [then] as long the health check fails on one of them, it will fail on everything related to that upstream. That’s why we can break it out and it will still work fine for everything.
The health‑check logic is pretty straightforward. We’re saying that we want to match on the criteria from
is_working. If it fails twice, we’re going to consider it bad. It’s going to check every 15 seconds, so it has to fail twice within 30 seconds. We want to slow that down for positives, so we have our
passes take a little bit longer, and then we have the
There’s lots of other configuration you can do within the health check itself, but for now (for just the simplicity of the configuration here), I’m just showing you a bare‑bones config. So, this will get you going for sure, and you can kind of go from there.
That’s our bare‑bones config for multi‑region resilience. It’s very simple, but it really does give us an auto‑failover from one region to the other.
Let’s talk about the knife edge. One of the things about this configuration is that if we just put this in here right now, every request for my app would immediately start going to our application’s upstream. It wouldn’t be split between the data centers. Let’s talk about how to deal with that.
Why do we want to remove the knife edge? We want traffic to be moved from one origin to another in a controlled manner. When you’re working at scale, if we can send just 10% of that traffic to a new microservice and then start slowly, methodically ramping that up to 100%, we’re going to be much better off.
If you’re taking the kind of traffic that Expedia takes, you don’t want to break everything, even if it’s just for a few seconds. You really want to try and do this as slowly as possible without being too slow.
Obviously, this method is only useful for URL patterns that are currently taking traffic. If you’ve got a brand‑new URL pattern that’s going to go to a brand‑new microservice, you would not use this configuration.
Here is a diagram of what we’re going to do. Traffic is going to come in from the CDN. It’s going to hit that NGINX cluster and the NGINX cluster’s going to split that, however, we need to, between your app in the cloud and your data center.
11:51 Setting Up the Cookie
Let’s talk about the cookie first. It’s a pretty basic configuration.
We turn on the
userid cookie. We set what the name of the cookie is going to be. We set the path. We set the expiry pretty high at 365 days because we don’t need it to change very often.
There’s one big caveat with this, and that is that the
userid cookie is going to generate an ID, but it’s not guaranteed unique. It’s not a GUID. It’s close to a GUID and it’ll be fairly unique, but you will get duplicates.
If it’s really important for you to have a perfect split and everybody get a unique cookie, you can’t use the User ID plug‑in. You’d have to come up with some other mechanism to generate a cookie prior to actually hitting the proxy or before NGINX actually starts processing the request. This could be done with Lua, but that’s beyond the scope of this presentation.
13:00 Split_Clients Config
split_clients configuration is also really simple. The
split_clients kind of works like a map — if you’ve ever used maps within NGINX – where we’re going to inspect the
$cookie_ourbucket, and put that value through a hashing algorithm. The algorithm is MurmurHash2 which takes that key or that value and generates a big number, and from that number, generates a percentage.
If you fall into a certain percentage – say we get 9% – we would go to our app_upstream upstream group. If we’re anything above 10%, we’re going to go to our data center. What happens then is the value of
app_upstream or value of
datacenter gets applied to the variable
There are three things I want to mention about this. You cannot use variables within the value of your
split_clients. You can use upstreams, but you just can’t use variables. The variable will get interpreted as a string and you’ll get a string with your variable name.
Also, zero is not a valid option. So, you cannot put a zero as a percentage. If you do have to take your percentage back down to 0%, you should either comment the line out or just remove the line.
The other issue is with
$cookie_ourbucket. The very first time a user hits your proxy and generates the cookie, that’s when the User ID plug‑in is generating that cookie – which is perfect, it works great, except that when the cookie is generated, it does not get applied to the variable
So, the very first time a customer comes into your proxy without the cookie and it’s generated, this variable will be empty. Then your hashing algorithm hashes every one of those people exactly the same way, with the exact [same] percentage and you end up with a pretty bad split.
15:10 Modified Split_Clients Config
So, what we came up with was a slightly different modification to this configuration. It uses two variables that are available from the User ID plug‑in. It’s a very small change. Instead of inspecting
$cookie_ourbucket, we’re going to inspect two variables;
These two variables are mutually exclusive variables: only one will ever be set at a time.
uid_set will be set when the User ID plug‑in sets the cookie, and it’ll be blank when the cookie has been sent in from the browser.
uid_got will have the same value when the request has come in from the browser with the cookie, and it’ll be blank whenever the User ID plug‑in is setting it.
Effectively, what we’ve done here is we’ve used two variables. We know that one of them’s always going to be blank, so we end up getting the same result from both of them. This way, even for your very first request that comes into your proxy, you’re still going to get a nice split.
16:28 Traffic Routing with Split_Client Values
The last bit of this configuration is really simple and is just setting the upstream in the /myapp
location block. All we’re going to do is instead of using app_upstream like before, we’re going to use
$app_throttle, which we set previously with that
Really straightforward, and what we end up getting is a nice split that we can control with our code just by switching the percentages.
What we like about this kind of configuration is how easy it is to roll back. You may have a team that ramps up traffic to a new app and it looks good. Everybody’s happy, everybody goes home, and then over the next couple of days you’re actually making more configuration changes for other different microservices teams or whatever. Then the original app team comes back and says, “Hey, we need to roll back. We have a problem that we didn’t recognize on deployment night.”
Now, instead of having to actually roll back the code, you can roll back by just changing your
app_upstream to zero percent, and all of a sudden you’re back to the data center for all the traffic.
18:07 Reacting to Application Errors
The last thing I want to talk about is reacting to application errors. I’m going to talk about two different types of errors: hard errors and soft errors.
Hard errors are pretty obvious, these are your
500‑level errors that are clearly application misconfigurations or failures. Soft errors, on the other hand, are those application errors that require the request to be reprocessed.
We’ll generally do soft errors in HTTP land. You’ll generally do soft errors with a redirect – a
307, something like that. I’m going to also show you some other ways we can do that which are actually pretty interesting.
18:54 Reacting to Hard Errors
Let’s talk about hard errors first. Why do we want our proxy to look after hard errors?
The first reason is: we get a nice, unified error page. It’s important that customers always get a nice error page. If you’ve got a lot of microservices out there in the cloud, you don’t want to have to have all of your microservices have some code in them that deals with displaying an error in a nice way. That gets even more complicated when you’re dealing with multilingual sites and instead of having a very simple error page, you have to have 20 different error pages because you have 20 different languages to support.
A couple things are really nice about this. Our apps can go back to a process of just responding with a
400 or a
500 error code when it truly is a
500 error. We can get it logged properly in the application. We can get it logged properly within our proxy, and still make sure that we’re sending a nice error page back to the customer.
The other thing is, with
500 errors, because we know that the proxy is going to be responding with a nice error page, we can let our app teams send stack traces out on those errors since we know that those are not going to propagate out to the customer.
The last benefit is the application developers just don’t have to worry about errors. They get to do the things that they are paid to do. They don’t have to worry about the errors, and making sure that error pages look right.
So that’s why we handle hard errors at the proxy level.
20:54 Reacting to Soft Errors
In terms of soft errors, I’m talking about requests that the app may not be able to handle. You don’t want those requests to go to the app, so you added a whole bunch of lines to your NGINX config to do that. For instance, if a particular query string is matched, you don’t send the requests to the cloud, you send it back to your data center, because you know your data center can handle it.
That may work in the short-term, but it creates this tight coupling between NGINX and your app or microservice, and we all know we should probably be trying to architect for fairly decoupled systems.
An example we had with Expedia was we needed to move our hotel‑search page from an old URL pattern, which included old URL patterns of their query strings, to a new URL pattern that was a completely different set of query strings.
What we ended up doing was building a microservice that could just do that translation. But hotel searches are actually quite complicated for various things, so there were certain features of the hotel search that we couldn’t easily translate. So what we did was we just said, “All of the traffic for the old pattern is going to go to this microservice, and when the microservice itself can’t do the response, we’ll have the microservice send an error back to the customer, and we’ll do that with a
302 or a
307“. That’s how we originally implemented this. It was handled with the redirect.
So normally what you do is add a query‑string parameter like nocloud or noapp or similar, and then key off of that within your NGINX config.
When NGINX sees that query‑string parameter, you can just short‑circuit all of your routing and route it all the way back to the data center. So, that’s one approach.
Another reason to use soft errors is that maybe as you’re migrating your applications to the cloud, you want to get things up quick. You might want to build 80% of the features for now, and the last 20% you’re going to build over time. This is another reason you could use this soft‑error approach.
This is what it would look like if we handle soft errors with a
Your request would come in for
/myapp. The NGINX proxy is going to route it through your cloud. The cloud app, for some reason, can’t handle the request so it responds with a
302 and the location in that
302 header is the same request but with a question mark and noapp=1.
That goes all the way back to the browser, the browser re‑requests the new page. The NGINX proxy grabs it and says, “Oh, that’s got a noapp=1 on it, so I’m going to route it to the data center.”
That’s one approach to it, but there’s actually a better one.
24:11 Soft Errors – A Better Approach
We can react to soft errors within the proxy itself.
What happens is the browser is still going to make that initial request which will go to the app, but instead of responding with a
302, the app will respond with a special error code. At Expedia, we use a nonstandard HTTP error code, which I’ll show you later.
Then, the proxy is going to use its error‑handling system to repeat the exact same request, but route it to the data center. The customer gets the same result, but we skip the step of the browser having to do all of the
307 work. If you’re dealing with CDNs, especially if you’re across the world, you can reduce significant latency this way, because the requests don’t have to go all the way back to the browser.
This is what the configuration for that approach looks like.
What we do is we’re going to use NGINX’s error handling. We set an
error_page for a
352, which is our nonstandard error code. When we get a
352, we’re going to send it to our location named @352retry. And then on a
404, we’re going to send it to our @404fallback location.
We have the
proxy_intercept_errors turned on, which is crucial. Without
proxy_intercept_errors, none of this works – all of this error handling just doesn’t happen.
location blocks are pretty straightforward as well. So the reason I have
$original_error_code is to help log the original error. When NGINX makes the secondary request for your error handler, it’s going to respond, and the response that you get from the error handler is actually going to be what’s logged. So, if you don’t log your original error code, you’re going to lose what was going on with the original request.
There’s lots of other things you could capture here as well. If there’s some headers or something else that you wanted to grab, you can also log those.
proxy_pass line is pretty straightforward. We’re going to route to our datacenter upstream, and then we’re going to use the
$uri variable, which is just the original request URI. We’re going to use
$is_args, which will give us a question mark if there were arguments, and it’ll be blank if there are not arguments. And then we’ll use
$args for the actual arguments.
location block works really similarly. We’re going to log the
404, and then we’re going route to the 404ErrorPage.
So that wraps up the three big steps, the three pillars we have for moving to the cloud.
We really want that multi‑region resiliency to make sure that when something goes wrong with an app, we auto‑route that traffic to a different region and get a latent response, rather than a bad response.
split_clients config that we did to avoid that knife edge is so important when you’re dealing with microservices. Especially in situations where you have significant amounts of existing traffic, it’s so important that you don’t just flip all that traffic over to a new service at once. We’ve seen it countless times where if you just flip it over, you’re going to break something, and you just want to avoid that.
And then this whole idea of soft and hard errors really is just using the NGINX config to its potential.
Published at DZone with permission of Floyd Smith. See the original article here.
Opinions expressed by DZone contributors are their own.