This is Part Two of our 2-part blog post on the challenges of monitoring containerized applications. In Part One we discussed basic requirements, considerations, and components involved in a production-ready Docker setup. We touched upon the ephemeral nature of containers, the importance of collecting application and container metrics, and the ever increasing cost of configuration management.
In this post, we’ll detail the seemingly impossible technical challenge of accurately and efficiently monitoring a dynamic infrastructure that is constantly shifting. Then we’ll describe the radical approach OpsClarity employed to turn a spectacularly complex problem into an elegantly simple tool.
At OpsClarity, we have taken an application- (or service-) centric approach towards monitoring. Services are treated as first class citizens from which you can start navigating into containers that make up the service, and finally to hosts that run the containers. You can see a sample service-oriented topology below. Every node represents a logical service (with multiple containers powering the service) while the colors red, orange, green represent the health of the service. The lines between services show how services are connected to each other.
Why do we need a topology at all? What purpose does it serve in a containerized world? Well, a clear topology visualization of your container infrastructure is much more than a pretty representation of your application:
- Containers add an additional layer of complexity and navigation to your applications. With legacy systems, it used to be sufficient to know which hosts or group of hosts would be potentially running a specific type of service. However, with containers, containerized services run on multiple hosts and the containers can be dynamically moved around by your orchestration tool, such as Kubernetes, Mesos, or Docker Swarm. Your modern infrastructure is far more dynamic than just a group of well-known hosts. At any given point, you need an easy way to get a top-down view of your application and to drill into the three logical layers of your application deployment — service, container, host, whether it be in your private datacenter or in a public cloud.
- Containers talk to each other. If this information was surfaced directly, you would see thousands of connections — from every container to every other container. Instead, it is important to group and cluster containers together and provide a higher level view of how one group of containers talks to another group of containers. This makes it easier to know exactly how your deployment architecture is set up, which itself is dynamically changing all the time with the addition and deletion of services.
- Troubleshooting during an outage can take hours. You are probably wading through (or drowning in) hundreds of metrics from your respective services, hosts and now, containers, trying to mentally correlate them to get to the root cause of an issue. In a containerized world, you should be able to quickly narrow down a service that has issues and be able to get to misbehaving containers in a few minutes. Navigating through the application topology provides you with a top-down view where you can isolate the “misbehaving” application and services first, and then drill down to the specific service instance and containers/host. This significantly reduces the time and energy spent on root-cause analysis.
To understand how OpsClarity detects your applications and automatically builds out a complete topology, let’s look at a very simple example. Consider three containers, each running on separate hosts, that run a Java application called “Connector.” The application connects to three containerized MongoDB instances to pull data. The service level topology of this simple setup is represented as follows:
Service-oriented topology. Note that there are multiple containers powering each of these services.
Let us understand how we built this topology. To simplify, take a look at a single host running one container for Mongo and one container with the Connector service.
We see two containers running on this host. “mongodb” is the container running the MongoDB service, which listens on port 27017 inside the container. It has a mapped port on the host at 32768. “mongo_connector” is a simple application that connects to the mongodb container and fetches some data. It does not have any listening ports that are mapped.
OpsClarity runs a series of “ps” and “netstat” commands, or their equivalents across the hosts. It fetches the output of the “ps” command to build a list of all running processes on the host and inside the container across the entire infrastructure. It then groups services by their process signatures across the infrastructure. The processes from multiple containers and hosts automatically get clustered into services using our unique clustering algorithms in real-time. OpsClarity ignores system services such as init, cron, etc., which provide no real value to build out a logical service view. OpsClarity builds an internal data model that precisely classifies services, the containers running a specific service, and the hosts they reside on. This enables OpsClarity to map out the services, but without any connections.
Detecting Connections Between Services
Next, let’s see how OpsClarity maps out the connections across services.
It starts with obtaining the IP Address of each container. Executing a “docker inspect” on the containers, we get the following IP Addresses:
The 172.xx addresses are created by the Docker networking module on the virtual bridge network. These networks are not directly accessible to services outside the container, unless one sets up an overlay network. The mapped port on the host, 32768, is a Docker proxy that forwards traffic on 32768 on the host into the container’s network, thereby making Mongo available to the outside world.
Once OpsClarity obtains the known IP Address of the containers, it looks at the output from netstat inside the mongo_connector container:
From the above set of data, it is clear that the output from netstat shows a connection to mongodb. Every connection also has a PID that tells you the program that is actually involved in establishing the connection. OpsClarity runs its clustering and correlation algorithm on this data that does two things:
- Uses the process and network information to chart out a raw map of every process and how it is connected with other processes.
- Overlays clustered service information on top of PIDs. It aggregates the connections from a PID to PID view to a clustered service to clustered service view, while simultaneously clustering connections and weeding out ephemeral connections.
Automating Configuration Management for Container Monitoring
As discussed in Part One of this series, configuration management for your monitoring tool in a containerized world can be challenging. The containers can move around between hosts, or, new containers are being constantly spun-up. Performing availability checks and metric collection for the newly created container can be a futile exercise if you rely on static configuration files to manage your monitoring configurations. For example, if a containerized Elasticsearch cluster scales automatically, how do you figure out that you need to push Elasticsearch configuration files into this new container?
At OpsClarity, we handle configuration management and monitoring as being closely interdependent. That means you don’t have to fiddle around with Chef or Puppet to maintain your configurations correctly. The cost of configuration management grows too high if it is dependent on laborious, time-consuming, manual setup. For that reason, our mantra at OpsClarity is to always auto-configure and auto-collect metrics, thereby eliminating excessive overhead from configuration management. Take a minute and think about how many configuration files you change each day. Time is money, and manual configuration wastes a boatload of both.
So how does OpsClarity provide auto-configuration for your dynamic infrastructure? As we mentioned before, our topology discovery already discovers and clusters containers running the same service.
A configuration that is applied to a service automatically gets trickled down to all the instances of that service. OpsClarity also automatically detects changes in a service and reconfigures, if so be the need. For example, if the number of Elasticsearch containers increases due to a higher system load, the same configuration relevant to the Elasticsearch cluster gets automatically applied to the newly spawned containers. If ports change, OpsClarity dynamically reconfigures to understand the new ports. Configuration management becomes a breeze even with a dynamic infrastructure like Docker.
Following Containers That Move Around
Most monitoring plugins that collect metrics from different types of services typically need an IP address and a port where metric data is made available. However, with containers that are orchestrated by Kubernetes, Mesos, and Docker Swarm, the containers are dynamically moved around to optimize resource management. Containers can be relocated, new ones created and existing ones destroyed. From our discussion so far, we know that newer containers get the monitoring configs pushed to them through OpsClarity. However, you can see the next problem that pops up. Docker assigns IP addresses to containers such as 172.17.0,4, 172.17.0.5 and so on. If metrics are made available on an IP:Port combination, you cannot push the same IP address to all the containers.
When we push configurations at the service level down to containers, we use the IP Address of the container to dynamically substitute in the IP:Port combination. Internally, the OpsClarity agent looks at all the listening ports and builds a map of the IP address for the listening port on each container. When a service inside a container needs to be monitored, we automatically figure out the IP Address for that listening port and pass that configuration to the plugin that collects data. In the above case where configuration had to be passed to Elasticsearch for metric data collection, we detect the IP Address of each container and push the config for data collection to that specific container. OpsClarity thus has the following configurations automatically generated:
|Container||Container IP||Configuration generated|
|container 1||172.17.0.5||Port: 9200
|container 2||172.17.0.6||Port: 9200
|container 3||172.17.0.7||Port: 9200
Auto-Scaling and False Alerts
In a containerized world, things are ephemeral by design. As load increases, the number of containers could be set to automatically increase in count to handle the higher load. Similarly, containers can be destroyed if there’s a load reduction. The idea is powerful and allows for the most efficient use of resources. But with all that efficiency comes a difficult monitoring problem. Traditionally, most monitoring tools do availability checks to determine if a container is up or down. These tools deal with statically configuring thresholds – a minimum and a max number of containers. If the number of containers go below/above the threshold, you get alerted. As you can see, this is both imprecise and inflexible. How do you estimate the thresholds? Is it just a number that you pull out of thin air?
For this reason, OpsClarity does not use static thresholds. Instead, we automatically correlate the number of containers with relevant metrics to see if they have the same pattern. We do this through anomaly detection models. Let’s take Kafka, for example. If the incoming message rate in Kafka drops, and the number of containers handling Kafka messages increases, it is an anomaly. You’d ideally want the container count to increase only when the number of incoming messages also increases. This helps you get away from static thresholds and lets the detectors handle the dynamic thresholds.
And there you have it. The problem of monitoring and managing complex, dynamic, containerized infrastructure is answered with a level of simplicity that only seamless automation can provide.
In the next post of this series, we will discuss a troubleshooting use-case and to demonstrate how to quickly isolate a root cause in a dynamic application that leverages several containers.