Interview: Kirk Pepperdine, Performance, and the Dominating Consumer
Kirk, the "dominating consumer" is one of your key terms. What is it and why is it so central in your performance tuning sessions?
The "dominating consumer" is a term I coined to help describe a way to identify the fundamental performance problems that an application is facing. Generally, what I find is that no matter how many performance problems are buried in an application, there is one problem that will express itself more prominently than any of the others. Also, performance problems will all express themselves in the hardware, in some manner. So, the purpose of the term "dominating consumer" is to help one identify the underlying nature of the performance problem. My definition for "dominating consumer" is "that activity that is controlling how the CPU is being utilized".
When you look at a system this way, especially a Java system, what we normally find is that performance is bound by execution, by space, by i/o (either network or disk i/o), or by locks. But all these conditions show up as different patterns in how the hardware is being utilized and they all express themselves differently.
So the purpose of the term "dominating consumer" is to help people understand what the relationships are and then to help them use that information to isolate the problematic code in their application. It turns out that there are only three candidates for dominating consumer: the operating system, the JVM, or the application, though there's actually one other case, which is when there isn't a dominating consumer at all.
- When the operating system dominates, the problem is rooted in some bad operating system interaction. For example, heavy load on a disk system will cause the operating system to do extra work and that's reflected in the operating system load on the CPU.
- When the JVM is the dominating consumer, we have to look at object lifecycles, that is, at the creation of objects and garbage collection.
- When the application is the dominating consumer, it means that there's some algorithmic issue, in which case we'd use traditional application profiling to identify the problem.
So, as you can see from this description, what we're doing is we're setting the table for the next round of investigation. We're really deciding what profiler we should use in order to be most effective in profiling the problem. And what we're looking for in the profiling tool is a direct vector into the code that's responsible for the problem.
Is it always a code problem, then?
Well, sometimes it can be a JVM configuration issue or you simply don't have enough hardware. There are these kinds of issues you have to consider. But, after that, it's a rare occasion where we don't actually go in and change code in the application.
Since most problems will require you to make some code alterations, this is exactly why we want a vector into the code. I don't want to be hunting around in the code where the problem might be. Instead, I want to have a tool that will show me exactly where the problem is. By using this concept of a dominating consumer, I know exactly what type of profiling will be most effective in threads, memory, execution, or whatever, in getting me that vector into the code.
OK, so how do you decide which is the dominating consumer?
What I look at are three key performance indicators: user utilization of the CPU, operating system utilization of the CPU, and garbage collection metrics.
What I look for are different patterns in these key performance indicators and these different patterns point to one of the possibilities as being at the root of the problem.
What are some of these patterns?
The first pattern is system time. If the system time is greater than 10% of total CPU and/or it is equal to or greater than user time, on a consistent basis, then we have the condition where the operating system is being overutilized/overworked. Consequently, what we see is that the application threads either won't be able to get into the CPU or that they're going to get very little out of the CPU. That's one pattern.
For another pattern, I'll take the work I did with the Scala compiler as an example. In that investigation, I very quickly determined that the Scala compiler is, ironically, single threaded. That's quite easy to see because on my dual core machine, only one CPU can be fully utilized. In that case, I would have said that nothing dominates, i.e., that there's no dominating consumer of CPU. The debuggable question that comes from that analysis is, quite naturally: "What is keeping my threads out of the CPU?" With the Scala compiler, the answer was easy, that is, there was only one thread to be scheduled into the CPU.
Given that analysis, I still went back and did an analysis on the Scala compiler, on the basis of the question "who was the dominating consumer of the CPU", given that only a single CPU was being utilized. In this case, system times were very low, 2 to 3 percent, and user time was running at about 95 or 96 perecent. So that meant that the dominating consumer had to be the operating or the user space, i.e., the JVM or the application. Quick analysis of garbage collection metrics led to the diagnosis that the JVM was the dominating consumer. The follow up to that was an investigation on how the application utilized memory, using a memory profiling tool, whereby I found that resizing the symbol table was the culprit. All of that analysis took about 10 to 15 minutes.
So, basically, knowing what to look for saves a lot of time.
Definitely saves a lot of time. Staying out of the code saves a lot of time. Using the tooling to direct your efforts saves a lot of time, makes you much more effective, and keeps you focused. It helps you build a story, that builds in your mind. You build up a body of evidence, until finally you have so much evidence that the problem becomes clearer, letting you vector into the code, and you can then stay focused on what the problem is.
After all, the biggest problem is that development teams focus on ugly code or code that doesn't look good. And then they get distracted from the real problem. They make changes that have no impact on the underlying problem. In one example I can cite, the developers focused on some ugly Hibernate code. The real problem, however, was loitering objects. But only tooling could expose the loitering object issue. Because the developers looked at code and not at tooling, they completely missed the problem, spending weeks on fixing their ugly Hibernate code. And, at the end of the day, they still had the same performance problem. The primary condition inhibiting performance still existed, even after they had fixed the application, so that they were still left with the same problem at the end of the day. They came perilously close to having the project completely cancelled, which would have been a financial disaster for the company.
You set great store by tools. But don't companies dislike it when you install foreign tools into their systems?
They don't mind it when I do it because by the time I get in there they're pretty desperate to get things fixed. Most of the time, I get calls on Thursday saying "can you be here by Friday". That's seriously what happens. Under those conditions, the patient is almost dead on the table and you're only there to see if you can revive it. So... they don't worry about foreign tools at that point in time.
So, what are the tools you use then?
Anything I can get my hands on. A lot of commercial and open source tools are out there and I'm happy to use anything that's out there. Some are obviously better than others; some work better in some environments than others. I do like the NetBeans Profiler for finding memory leaks. I can generally find any memory leak within 10 to 15 minutes, with the NetBeans Profiler. As incredible as that sounds, I'm not the only one; it is fairly universal. If you develop a good methodology, you can very quickly identify memory leaks in any application.
I do wish the NetBeans Profiler worked well behind firewalls and in virtualized environments. However, it doesn't, so when I'm in those environments I'll turn to YourKit. It's not as good at memory leaks, but it works brilliantly in hostile environments, that is, behind firewalls, through ssh tunnels, things like that.
This week you met the NetBeans Profiler team in Prague, right?
Right. Some very interesting stuff is coming out in that area, and I'm looking forward to it, but I don't want to steal their thunder. I was able to ask for some things that I'd hope they'll put in. They were very accommodating and open to experiences from the field. For example, I wanted them to offer me a headless version of the agent, but it doesn't look like that's on the cards anytime soon. But there's some other really cool stuff they'll focus on that'll be very very helpful.
Another thing is that I don't have enough information currently to identify the dominating consumer via VisualVM. But it's only a small modification that would be needed to let me do that. An indication of system/user breakdown on CPU utilization is what is needed. I asked for that and they're looking into it. They can do that easily with Solaris, though it's more difficult with other systems, because there they don't have access to DTrace. On the other hand, the garbage collection guys give you that kind of information regardless of the system, so I should remind them about that if they don't read it here first! But this really isn't a complaint because those guys are making some really cool tools.
Finally, do you have one single performance tip that readers of this article should put under their pillow so they can sleep on it?
Only the good die young.
Really? What does that mean?
It means that the best way to help your garbage collector is to let your objects die young. Narrow the scope as much as possible. Try to make everything instance based, don't use statics, don't use instance variables. Though that sounds stupid, if you can get away with only local variables, you're really playing into the garbage collector's strengths. Of course, you're not going to be able to do that but the point is that this does stress the right things, since cost to the garbage collector is dominated by the number of objects that survive.
Therefore, my tip is that if garbage collection is a problem for you, simply try to ensure that objects don't survive.
Thanks for the tip and the interesting discussion, Kirk!