I wore a “Why Be Normal?” button with pride throughout college. And I did it out of choice rather than as a piece of required flair from the infamous “Office Space” restaurant, Chotchkie’s.
In the world of database performance, “normal” is a highly prized goal. It means nothing extraordinary is happening that requires intervention or diagnosis. It means no fire-fighting or frantic finger-pointing exercises across functions in the IT department. But to be measured and monitored, “normal” must first be defined — a task that is sometimes easier said than done.
If you’ve ever attended a best practices presentation for DBAs, you know the first step in measuring and monitoring database performance is to establish baselines. The problem is that people throw that term around while often meaning very different things.
Defining Baselines and Anomalies
At its most basic definition, a baseline is any starting measurement taken to evaluate future changes against the same. It doesn’t have to be representative or taken over time. But it should be taken at an appropriate time, and exclude potential “outliers” that might throw your data off — thus establishing a “normal” operating range for your system.
All database performance monitoring tools aim to detect anomalies, but this word means different things to different people. The most basic monitoring tools detect deviation from default best practice settings. The challenge is finding the right balance between false negatives and the danger of missed problems versus false positives and the overhead of managing too many overly cautious alerts.
Experienced DBAs often develop their own best practice settings for alerts based on:
Personal tolerance for risk of missing a real problem.
Knowledge of different application behavior and instance performance.
Criticality of specific applications or systems to the business itself.
The Art and Science of Alert Tuning and Filtering
Tuning alert thresholds can be more of an art than a science. Even experienced DBAs may not be familiar with every instance under their care to know the right balance to strike immediately. In this case, they must watch performance over time and manipulate thresholds gradually, often establishing a template of thresholds for a particular set of instances based on different combinations of the three criteria above.
When manipulating simple thresholds is insufficient, DBAs may resort to filtering or alert suppression to filter out temporary spikes:
Some monitoring tools permit filtering of alerts on particular databases, files, or disks that may be deemed unimportant for one reason or another. Ideally, DBAs want to auto-calibrate threshold tuning as much as possible to minimize the need for extensive experience with every application and instance, and to reduce the risks of trial-and-error processes.
“What Is Normal?” vs. “What Is Extreme?”
Many monitoring products claim to automate baseline measurement, but this often means different things in different tools. Most will plot a reference line or a range of values and leave it to the DBA to assign meaning to that range, which may or may not represent what is normal for a particular database instance.
Using minimums and maximums is the easiest way to determine a range of values for comparison to a current value. However, using outlier values as an approximation of normal behavior can be problematic and misleading.
In a time-series metric, it is possible to correct for trends by using a moving window of time for the calculation population and calculating an accepted range based on the mean and standard deviation. Some tools can also correct for cyclicality by using different calculations for times when they observe different performance because of changing the load on the database or system.
What Does “Default” Baseline Mean Anyway?
Traditional monitoring tools only provide one performance baseline calculation. With more modern tools, that calculation is named the default while the user can add as many baseline calculations as desired. The default calculation period can also overlap with any other baseline period.
The “default” baseline can be used in several ways:
To calculate alert recommendations.
When no other baseline is established.
As a band on the threshold graphic.
You can either use a dynamic baseline or specify a particular time period if you want more review control before changing alert thresholds or to preserve a particular high point of activity for threshold settings.
One way to use baselines effectively is to periodically, but manually, adjust alert thresholds after reviewing recommended settings. The recommended changes will always be based on some set percentage of the baseline, which is always set as the mean plus one standard deviation.
Only the alerts that normally exceed current thresholds will generate alert recommendations. Users can accept all recommendations or choose certain ones to change.
The default baseline and alert recommendations are a good way to tune static thresholds based on measured performance of particular databases or systems.
DBAs who take the time to accurately define baselines will lay the groundwork for optimal database performance. But baseline measurement can be tricky. By understanding the nuances of alert tuning and filtering while using the latest performance management tools, DBAs will ensure their databases operate in the most wonderful way possible: normally.