So now we’ve done some of the basics, and it’s time to take the training wheels off. Let’s take a look at some more complex, real-world metrics you should pay attention to. We’ll show you the metrics, talk about why they’re important and what they might mean. For this section we’ve visualized the data using Sysdig Cloud, the commercial version of Sysdig that’s designed to aggregate data across many hosts and display within a web UI. You could do the following examples via any of the open-source time-series databases, provided you’re collecting the correct information.
Visualizing CPU Shares & Quota
For those of you used to monitoring in a VM-based world, you’re likely familiar with the concepts of CPU allocation, stolen CPU, and greedy VMs. Those same issues apply with containers, except they are magnified significantly. Because you may be packing containers densely on a machine, and because workloads are typically much more dynamic than in VM-based environments, you may encounter significantly more resource conflict if you’re not carefully monitoring and managing allocation. Let’s focus on CPU, as it’s a bit more complex than memory.
Let’s start by visualizing CPU shares. Imagine a host with 1 core and 3 containers using as much CPU as possible. We assign 1024 shares to one container and 512 shares to the other two. This is what we get:
First is using 2 times the Host CPU than the others because it has 2 times the shares. All of them are using 100% of CPU shares assigned. But what happens if Third does not need any CPU at all?
The amount of unused shares is given to others relative to their weight. So if Third is not using any of its CPU shares, First and Second instead are using 140% of CPU Shares. In general, it’s OK to consume more shares than originally allocated, because the kernel tries not to waste CPU.
A percentage of shares used that’s consistently over 100 means we are not allocating enough resources to our services. The implication in the example above is that First and Second were able to consume much more CPU than they were originally allocated. If either of those were, for example, a web server, it likely means we are allocating less CPU than it needs to complete current user requests (that’s not a good situation). If either were a batch processing job, it means the job can use more CPU to finish faster (good, but maybe not critical).
Visualizing CPU Quota
Giving processes the maximum available CPU may not always be what you want. If your cluster is multi-tenant, or if you just need a safe ceiling for an unpredictable application, you might like to implement a hard limit on CPU utilization. The Linux kernel supports absolute CPU limits with CPU quotas. You assign a quota in milliseconds relative to a period, and the process will be able to spend on CPU only that fraction of time in a period.
For example let’s consider the same case as above, now with a quota of 50ms/100ms for First and 25ms/100ms for Second and Third:
The result is the same as with shares. The difference occurs when Third does not use the CPU allocated to it.
Now instead of giving CPU to other containers, the kernel is enforcing the absolute quota given. The total CPU usage we will see reported for the host will be 75%.
Basic Networking Data
Regardless of your platform, some things don’t change… and that’s certainly true when it comes to networking data. Especially with Docker in the mix, networking can become more complex and communication patterns can become more convoluted. It’s important to keep track of basic information, such as how much data is a container consuming? Emitting?
This type of data collection requires something more full-featured than the Docker API, so instead you could collect this type of information from open-source sysdig. Let’s look at some basic network data for a set of three containers each running the same Java application:
As you can see. there is some slight variation among these three containers. If, however, we saw an extreme variation, we may want to investigate further.
At the same time, since these containers are all running the same Java application, it may be more useful to consider them a “service” and see how they are performing in aggregate. This leads up to our last example.
From Container to Microservice Data With Labels
Docker provides a concept called “labels.” These are much like they sound—additional, contextual information is are applied on a per-container basis. They are unstructured and non-hierarchical. As such, you can use them to broadly identify subcategories of your containers. All the containers of a given service could carry the same label, non-standard containers could carry another label, different versions of software could have yet another label. If you’re a filer and an organizer, labels will be heaven for you.
So what can we do with a label? Well, the first thing is that we can aggregate data. From the example above, let’s suppose we applied the label “javaapp” to those three containers. Now, when we show our network data we see something much simpler:
One line—that’s it. In this case we’re showing the average network data across all three containers, but you could easily calculate anything that helps you better understand the performance of this collection of containers.
But let’s go a little further with labels, network data, and the “top connections” example we showed in the open-source section.
Using this information and an appropriate visualization, we can do more than create a table of network data: we can actually create a map of our services, the containers that make them up, and who they are communicating with. Here we can see the aggregated java service, the individual containers that make up the service, and (in a more complete view) would show all the other services in your environment that the java service communicates with. Note that this is a little more advanced than the other examples, and in particular the visualization may require some coding in D3 or something similar if you want to stay fully open source.
Here we see a few different things: our “javaapp” consists of three containers (blue) and a service called “javaapp” (grey), which is just an abstraction created by whoever is routing requests to those containers. We see each of those containers communicating with a Mongo service and a Redis service, and presumably those are made up of containers as well (hidden here to avoid too much complexity).
This view helps us in a few different ways:
- We quickly can understand the logical composition of our application.
- We can aggregate containers into higher-level services.
- We can easily see communication patterns among containers.
- We may be able to easily spot outliers or anomalies.