My first job as a software engineer was at Motorola, Inc. in Arlington Heights, IL. I worked on cellular base station software that did call state management across different radio protocols. For telephony software, five nines availability (i.e. less than 5.26 minutes of downtime per year) is a requirement. At Motorola, and in my subsequent years at Bell Labs, I learned about how highly-available systems are built and operated.
At that time (way back during the mid-90s) high-availability required expensive and complex hardware where every component had an active spare. Today, things are much easier. It’s now possible to build highly availability in software, with application components replicated across commodity hardware. And using containers for deployment and operations, we also get fast recovery times. As low as a few milliseconds, with intelligent caching, and sophisticated management tools like Kubernetes and Nirmata.
However, what remains the same is the need for designing proper systems management. This includes managing state updates across all components of the system. And with microservices-style applications, where an application can be composed of several services, and each service can have several instances, implementing reliable and scalable management has become even more complex, as there are several disparate components to track and manage! Let’s take a look at how we address this complex challenge in Nirmata:
To manage service availability, we first need to know three things:
- What components matter to the service
- Which states can each component have
- How do we define and measure the service availability
Below is a partial view of the domain entities (objects) in Nirmata. As you can see, Nirmata collects and manages information for both infrastructure related objects (Cloud Providers, Hosts, Containers, etc.) and software application objects (Applications, Services, etc.)
This is important, as building a good state model requires a knowledge of dependencies and relationships across objects, as well as up-to-date information on each object’s state. For example, if a service’s availability is impacted because of a Host rebooting, this can now be correlated and reported.
This model also helps us separate critical issues from failures which are transient. For example, a single container exit may not impact service availability. In fact, containers are expected to exit as part of a rolling upgrade where an orchestration engine coordinates image upgrades across service instances.
Once we know what domain entities to track, the next step is to define and manage states and state transitions for each. In Nirmata, each object has two primary states and several secondary supporting states. The primary states are:
- Operational State: this state represents if and how the object is currently operating. For example, “up”, “down”, “failed” can be part of an object’s operational state model.
- Administrative State: this state managed by the user or administrator. For example, “disabled” or “suspended” are states triggered by user actions.
Beyond the primary states, each object can have several secondary states. These states are specific to the object and provide more details on the primary state. For example, most managed objects in Nirmata have an “execution state” which indicates that a system operation is being performed.
Here are some of the operational states that a Service (within an application) can have in Nirmata:
- Running: all instances of the Service are healthy and running.
- Degraded: some instances of the Service have failed, or are executing.
- Executing: all instances of the Service are executing changes.
- Failed: all instances of the Service have failed.
Now that we have the service’s state model, tracking availability becomes easier. In Nirmata, availability is calculated as the percentage of time the Service is “Running” or “Degraded”. The same approach is used for propagated states at the application level. In future releases, we have plans to allow users to customize how each service in an application is used to calculate overall availability.
Below we see the environments view in Nirmata. Each environment can have several applications each with its own availability, but we also see the overall availability for the environment:
Drilling down, we can easily see the availability for a particular application. Nirmata even shows the relevant state changes, and identifies the failure reason.
Here we see the individual service view and state changes for it:
Managing service availability for containerized microservices-style applications can be complex, as there are several things to measure and track. And, measuring availability requires a well-defined state model. The object states need to managed, correlated, and propagated across infrastructure and software entities.
With the right systems management foundation, it’s possible to measure, track, and report availability for a service, application, or an environment. By focusing on service availability, operators can easily separate the “signal from the noise” and only be alerted for service impacting issues.