Upgrading backend servers in a production environment can be a challenge for your operations or DevOps team, whether they are dealing with an individual server or upgrading an application by moving to a new set of servers. Putting upstream servers behind NGINX Plus can make the upgrade process much more manageable while also eliminating or greatly lessening downtime.
In a three-part series of articles, we’ll focus on NGINX Plus – with a number of features above and beyond those in the open source NGINX software, it’s a more comprehensive and controllable solution for upgrades with zero downtime. This first article describes the two NGINX Plus features you can use for backend upgrades – the on-the-fly reconfiguration API and health checks – in detail and compares them to upgrading with the open source NGINX software.
The related articles explain how to use the methods for two classes of upgrades:
Choosing an Upgrade Method in NGINX Plus
NGINX Plus provides two methods for dynamically upgrading production servers and application version:
- On-the-fly reconfiguration API – Use an HTTP-based API to send HTTP requests to NGINX Plus that add, remove, or modify the servers in an upstream group.
- Application-aware health checks – Define health checks so that you can purposely fail servers you want to take out of the load balancing rotation, and make them pass the health check when they are again ready to receive traffic.
The two methods differ with respect to several factors, so the choice between them depends on your priorities:
- Speed of change – With the API, the change takes effect immediately. With health checks, the change doesn’t take effect until a health check fails (the default frequency of health checks is 5 seconds).
- Initial traffic volume – With health checks, you can configure slow start: when a server returns to service, NGINX Plus slowly ramps up the load to the server over a defined period, allowing applications to “warm up” (populate caches, run just-in-time compilations, establish database connections, and so on). The server is not overwhelmed by connections, which might time out and cause it to be marked as failed again. With the API, NGINX Plus immediately sends a server its full share of traffic.
- Automation and scripting – With the API, you can automate and script most phases of the upgrade, and do so within the NGINX Plus configuration. To automate upgrades when using health checks, you must also create scripts that run on the servers being upgraded (for example, to manipulate the file used for semaphore health checks).
In general, we recommend the NGINX Plus on-the-fly reconfiguration API for most use cases because changes take effect immediately and the API is fully scriptable and automatable.
Upgrading With Open Source NGINX
First, let’s review how upgrades work with the open source NGINX software, and explore some possible issues. Here you change upstream server groups by editing the
upstream configuration block and reloading the configuration file. The configuration reload is seamless because a new set of worker processes are started to utilize the new configuration, while the existing worker processes continue to run and handle connections that were open when the reload occurred. Each old worker process terminates as soon as all its connections have completed. This design guarantees that no connections or requests are lost during the reload, and makes the reload method suitable even when upgrading NGINX itself from one version to another.
Depending on the nature of the outstanding connections, the time it takes to complete them all can range from just seconds to several minutes. If the configuration doesn’t change often, running two sets of workers for a short time usually has no bad effects. However, if changes (and consequently reloads) are very frequent, old workers might not finish processing requests and terminate before the next reload takes place, leaving multiple sets of workers running at once. With enough workers, you might eventually end up exhausting memory and hitting 100% CPU, particularly if you’re already optimizing use of resources by running your servers at close to capacity.
When you’re load balancing application servers, upstream groups are the part of the configuration that changes most frequently, whether it’s to scale capacity up and down, upgrade to a new version, or take servers offline for maintenance. Customers running hundreds of virtual servers load balancing traffic across thousands of backend servers might need to modify upstream groups very frequently. Using the reconfiguration API or health checks in NGINX Plus, you avoid the problem of frequent configuration reloads.
Overview of the NGINX Plus Upgrade Methods
The use cases discussed in the two related articles use one of the following methods, sometimes in combination with auxiliary actions.
Upgrading With the On-the-Fly Reconfiguration API
To use the on-the-fly reconfiguration API to manage the servers in an upstream group, you issue HTTP commands which all start with the following URL string. We’re using the conventional location name for the API, /upstream_conf, but you can configure a different name (see the section about the base configuration in the second or third article).
When you issue this command with no additional parameters, a list of the servers and their ID numbers is returned, as in this example for the use cases we’ll cover in the other two articles:
http://localhost:8080/upstream_conf?upstream=demoapp server 172.16.210.81:80; # id=0 server 172.16.211.82:80; # id=1
To make changes to the servers in the upstream group, append other strings to the base URL as indicated:
- Add a server – Append this string:
By default, the server is marked
upand NGINX starts sending traffic to it immediately. To mark it
downso that it does not receive traffic until you are ready to mark it as up, append the
- Remove a server – NGINX Plus terminates all connections immediately and sends no more requests to the server. Append this string:
- Mark a server as
down– NGINX Plus stops opening new connections to the server, but any existing connections are allowed to complete. Using the NGINX Plus live activity monitoring dashboard or API, you can see when the server no longer has any open connections and can be safely taken offline.
- Mark a server as
drain(ing) – NGINX Plus stops sending traffic from new clients to the server, but allows clients who have a persistent session with the server to continue opening connections and sending requests to it. Once you feel that you have allowed enough time for sessions to complete, you can mark the server as
downand take it offline. For a discussion of ways to automate the check for completed sessions, see Using the API with Session Persistence for an Individual Server Upgrade.
- Mark a server as
up– NGINX Plus immediately starts sending traffic to it.
- Change server configuration – You can set any of the parameters on the
serverdirective. We’ll use this feature to set server weights in several of the use cases.
There are a few significant differences between health checks and the API:Configuring application health checks is an easy way to improve the user experience at your site. By having NGINX Plus continually check whether backend servers are up and remove unavailable servers from the load-balancing rotation, you reduce the number of errors seen by clients. You can also use health checks to bring servers up and down, instead of (or in addition to) the API.
Upgrading With Application Health Checks
downbecause it fails a health check, the server no longer receives new connections, even from clients that are pegged to it by a session persistence mechanism. (In other words, with health checks you can set server state to the equivalent of the API’s
down, but not to
- Upgrading hardware or software on an individual server machine
- Upgrading to a new version of an application by switching traffic to completely different servers or upstream groups.