5 Things We've Learned About Monitoring Containers
As containers mature, it's key to look back and learn lessons from the past about mapping your data and monitoring your containerized apps.
Join the DZone community and get the full member experience.Join For Free
This article will cover how to build a scaled-out, highly reliable monitoring system that works across tens of thousands of containers. It’s based on Sysdig’s experience building its container
monitoring software, but the same design decisions will impact you if you decide to build your own tooling in-house. I’ll share a bit about what our infrastructure looks like, the design choices, and the tradeoffs we’ve made. The five areas I’ll cover are:
Instrumenting the system
Mapping your data to your applications, hosts, and containers
Deciding what data to store
Enabling troubleshooting in containerized environments
Ok, let’s get into the details, starting with the impact containers have had on monitoring systems.
Why Do Containers Change the Rules of the Monitoring Game?
Containers are pretty powerful. They are:
Simple: mostly individual process
Small: 1/10th of a VM
Isolated: fewer dependencies
Dynamic: can be scaled, killed, and moved quickly
Containers are simple and great building blocks for microservices, but their simplicity comes at a cost. The ephemeral nature of containers adds to their monitoring complexity. Just knowing that some containers exist is not enough: deep container visibility is critical for ops teams to monitor containers and troubleshoot issues. Let’s start breaking down these monitoring challenges.
Instrumentation Needs to Be Transparent
In static or virtual environments, an agent is usually run on a host and configured for specific applications. However, this approach doesn’t work for containerized environments:
You can’t place an agent within each container.
Dynamic applications make it challenging to manually configure agent plug-ins to collect metrics.
In containerized environments, you need to make instrumentation as transparent as possible with very limited human interaction. Infrastructure metrics, application metrics, service response times, custom metrics, and resource/network utilization data should be ingested without spinning up additional containers or making any effort from within the container. There are two possible approaches.
First are pods, a concept created by Kubernetes. Containers within each pod can see what other containers are doing. For monitoring agents, this is referred to as a “sidecar” container. This is relatively easy to do in Kubernetes, but if you have many pods on a machine, this may result in heavy resource consumption and dependency. This can wreak havoc in your application if your monitoring sidecar has performance, stability, or security issues.
The second model is per-host, transparent instrumentation. This captures all application, container, statsd, and host metrics within a single instrumentation point and sends them to a container per host for processing and transfer. This eliminates the need to convert these metrics into statsd. Unlike sidecar models, per-host agents drastically reduce resource consumption of monitoring agents and require no application code modification. In Sysdig’s case, we created a non-blocking kernel module to achieve this.
That, however, required a privileged container.
By introducing “ContainerVision,” Sysdig chose to do the latter, and herein lies the biggest tradeoff we had to make. Although running the monitoring agent as a kernel module raises concerns and implementation complexities, this allows us to collect more data with lower overhead — even in high-density container environments — and reduces threats to the environment. Finally, to address these concerns as a third-party software provider, we open sourced our kernel module as part of the Sysdig Linux and container visibility command-line tool. This latter point isn’t something you’re likely to deal with if you’re building your own internal tooling.
How to Map Your Data to Your Applications, Hosts, Containers, and Orchestrators
As your environment increases in complexity, the ability to filter, segment, and group metrics based on metadata is essential. Tags allow you to represent the logical blueprint of your application architecture in addition to the physical reality of where containers are running.
There are two tagging metrics: explicit (attributes to store) vs. implicit (orchestrator) tags. Explicit tags can be added by your team based on best practices, but implicit tags should be captured by default. The latter is a key element for orchestrators. Each unique combination of tags is a separate metric that you need to store, process, and then recall on demand for your user. We’ll discuss the major implications of this in the “Deciding what data to store” section below.
Orchestrators radically changed the scheduling management approach for containers and impacted users’ monitoring strategy. Individual containers became less important, while the performance of a service became more important. A service is made up of several containers, and the orchestrator can move containers as needed to meet performance and health requirements.
There are two implications for a monitoring system:
Your monitoring system must implicitly tag all metrics according to the orchestration metadata. This applies to systems, containers, application components, and even custom metrics.
Your developers should output the custom metric, and the monitoring system should keep the state of each metric. You can read more about this topic here.
Your monitoring agent should auto-discover any application and collect the relevant metrics. This may require you to update your monitoring system to provide these functionalities.
There are two methods to achieving this: depending on the events started by the orchestrator to flag containers, or determining applications based on the heuristics of a container. Sysdig chose the latter approach, as it requires more intelligence in your monitoring system, but produces more reliable results. You can read more on monitoring Kubernetes and orchestrators here.
Deciding What Data to Store: "All the Data"
Distributed systems increase monitoring data and the resulting metrics. Although it is appealing to reduce your metric count for cost and simplicity, you’ll find that the more complex your infrastructure becomes, the more important it is to have all the data for ad-hoc analysis and troubleshooting. For example, how would you identify an intermittent slow response time in a Node app with missing metrics data? How can you figure out if it’s a systemic problem in the code, a container on the fritz, or an issue with AWS? Aggregating all that information via microservices will not give you enough visibility to solve the problem.
This means we will collect a lot of metrics and events data. In order to have this data persisted and accessible to our users, we decided to:
Build a horizontally scalable backend with the ability for our application to isolate data, dashboards, alerts, etc. based on a user or service.
Store full resolution data up to six hours and aggregate thereafter. Our backend consists of horizontally scalable clusters of Cassandra (metrics), ElasticSearch (events), and Redis (intraservice brokering). This provides high reliability and scalability to store data for long-term trending and analysis. All the data is accessible by a REST API. You will likely end up building the same components if you create your own system.
How to Enable Troubleshooting in Containerized Environments
Containers are ideal for deployment and repeatability, but troubleshooting them is challenging. Troubleshooting tools — ssh, top, ps, and ifconfig — areneither accessible in PaaS-controlled environments nor available inside containers, and the ephemeral nature of containers adds to this complexity. This is where container troubleshooting tools come into play, with the ability to capture every single system call on a host giving deep visibility into how any application, container, host, or network performs.
Interfacing with the orchestration master provides relevant metadata that is not just limited to the state of a machine, but also provides the ability to capture the state and context of the distributed system. All this data is captured in a file, allowing you to troubleshoot production issues on your laptop and run post-mortem analyses at ease.
For example, when an alert is triggered by a spike in network connections on a particular container, all system calls on the host are recorded. Troubleshooting this alert via cSysdig provides all the relevant context and helps identify the root cause by drilling down to the network connections:
Building a highly scalable, distributed monitoring system is not an easy task. Whether you choose to do it yourself or leverage someone else’s system, I believe you’re going to have to make many of the same choices we made.
Opinions expressed by DZone contributors are their own.