This week, I was seeing a drop in average back-end performance at work, we had an average drop in page load performance from ~250ms to around 500ms. This seemed to be an intermittent problem and we searched through out graphs at NewRelic with no clear culprit. Then we started looking at our internal MediaWiki profiling collector, and some of the various aggregation tools that I put together. After a few minutes it became clear that the connection time on one of our databases had increased 1000 fold. Having recently changed the spanning-tree configuration, and moving some of our cross connects to 10Gb/s, I suspected this may have been a spanning tree issue. It turns out our Ganglia daemon (gmond) on that host had consumed enough memory due to a memory leak to negative affect system performance. Unfortunately this was a pretty inefficient way to find the problem, and reminded me of a few basic tenets of monitoring.
A simple MySQL test could just tell you whether your server is up or down. Your alerts probably even have timeouts, but in most monitoring tool I’ve seen these are measured in seconds not milliseconds. Your should have your alert configured to tell you when the service you’re monitoring has gone to an unacceptable level, and maybe effecting site performance. So, your simple MySQL check should timeout in 3 seconds, but alert you if its taken more than 100ms to establish a connection. Remember, if the latency is high enough your service is effectively down.
Monitoring your Monitoring
Sometimes your monitoring can get out of whack. You may find that you tests are consuming so many resources that they are negatively effecting your performance. You need to define acceptable parameters for these application, and make sure that its doing what you expect.
Set Your Alerts Lower than You Need
Your alerts should go off before your services are broken. Ideally this would be done with alerts on warnings, but for a good number of people warning are too noisy. If you’re only going to alert on errors, set your threshold well below the service level you expect to provide. For example, if you have an HTTP service that you expect to answer within 100ms, and typically answers within 25ms, your warnings should be set at something like 70ms and errors at 80ms. By alerting early, your preventing a call from a customer, or an angry boss.
So give these three things a try, and you should end up with a better monitoring setup.