Solutions for spotting those pesky outliers have been around in the world of credit card processing, banking, and security for some time now. “Anomaly detection”, a.k.a. fraud detection, is a pretty common term in those industries. That term has now made its way to the dynamic data center, where system anomalies can be characterised as either static or dynamic. Well, at least that’s how we’ve categorized them.
Static anomalies happen when there are state inconsistencies between servers in a cluster, either physical or virtual, or sometimes on a single server. You may start out with identical resources — OS versions, software packages, or data – but over time, even when systems are identically provisioned using automated tools, drift happens. It’s hard to guarantee ongoing uniformity or even uniformity immediately after a release or provisioning because:
- Machines which timeout during provisioning can miss out on certain installations
- Virtual machines created during or after provisioning miss out on installations
- Software downloads from online repositories may introduce differently versioned files from different servers
- Human error and failure to follow protocol result in ad hoc fixes or changes that don’t get logged or deployed properly
- Security breach or exploits can interfere with the base provisioning
In effect, static anomalies live on your servers and are found in your environment (we sometimes also call them environment anomalies).
Dynamic anomalies, on the other hand, show up as inconsistent server behavior. Even without environment drift, dynamic anomalies can materialize. A server could exhibit behavior unlike the others or even unlike its own behavior the day before. All it takes is:
- Bad code
- Network congestion
- A security breach
- Hardware hiccups
Those are the two main buckets we use for categorizing anomalies. But we’re always tossing around different ideas. How do you see the world of system anomalies? What kind of buckets would you use? Come to think of it, are there more than two buckets? Would love to hear how other ops folks think about this. Leave a comment or tweet us, @metaforsoftware.