Over a million developers have joined DZone.

Towards Auto-Tuning Systems

DZone 's Guide to

Towards Auto-Tuning Systems

Read about how embedded machine learning could create auto-tuning distributed systems that can test and monitor their own performance metrics.

· Performance Zone ·
Free Resource

The idea of distributing risk is fundamental to distributed systems or cloud computing. The chance of a node failing in a 2-node system is 50%, but the chance of 10 nodes crashing at the same time in a cluster of 20 nodes is substantially lower, and the probability of such an event happening is very remote (though it cannot be completely ruled out). Hence, scale-out architectures have gained momentum over scale-up systems, wherein the risk is distributed and incremental failures are handled much more gracefully without sacrificing the customer experience.

When you have many nodes, manageability, and monitoring, both in terms of hardware and software, come to the fore. The idea of monitoring, as has been followed, has a fundamental flaw. It basically assumes that the metrics that you are going to monitor are the fundamental tokens which will cause failure or will help you in diagnostics. In the software system that has a few thousand metrics, monitoring all the metrics and finding out the correlation between them is not a trivial problem, at least not a humane problem. And this is where machines can help.

Machines are fundamentally very good at number-crunching and processing. Correlating metrics is a hard problem. For example, let us say your software crashes, and one of the first things that you might look at is the system’s CPU and memory, but the number of open file handles is probably not one of the first top-10 metrics or even top-20 metrics that you will look at. Existing logging and logging techniques also lead to a reactionary workflow. Ding Yuan et al. [1] have identified that “Logs are noisy: the median of the number of log messages printed by each failure is 824.” In a distributed system, this compounds by many factors of magnitude. In a distributed database, if a record is not replicated, then understanding the flow across nodes and diving deep across the stack is not a trivial problem, certainly not one that your database administrators will keen on doing. Software – it should just work!

A reactive approach to managing failures, though it has been the modus operandi for a long time, is not sustainable when the number of nodes/systems to be monitored is too large. The standard complaint from the user of your software in such cases will be on the quality of the code and plethora of bugs that manifest when the system goes to Production. This leads to a suggestion of a system of top-down testing wherein the software is tested with all possible inputs in its input space and the behavior observed and course corrections suitably applied before the software makes way to the customer. Ding Yuan et al. point out that Hadoop has its own error/fault injection framework to test their system by introducing artificial faults, but also note that the "the production failures we studied are likely the ones missed by such tools," and this leads to questioning the very essence of testing. For any case (be it an edge case or not) that might be missed, that leads to a defective product. Also, the authors point out that "in a study on field failures with IBM’s MVS operating system between 1986 and 1989, Sullivan et al. found that incorrect error recovery was the cause of 21% of the failures and 36% of the failures with high impact [3]. In comparison, we find that in the distributed systems we studied, incorrect error handling resulted in 25% of the non-catastrophic failures, and 92% of the catastrophic ones."

Wrapping an exception handler around the code and repeating until the operation succeeds, along the lines of what [2] point out, though feasible, is certainly not deterministic and is an inefficient computing paradigm:

while (true)
  catch (NotFoundInContext)
  catch (NetworkServerFaliure)

Also, the exception handler’s paradigm limits the possible behavior that can be accomplished –  what happens if the exception handler fails? Essentially, no transactions are ever put in exception handlers and systems generally rollback. In many cases, exceptions are simply swallowed. Ding Yuan et al. make the following two observations with regards to handling such failures:

  • Almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software.
  • In 58% of the catastrophic failures, the underlying faults could easily have been detected through simple testing of error handling code.

They also make a significant observation that "a majority (77%) of the failures require more than one input event to manifest, but most of the failures (90%) require no more than 3." The possibility of an event that leads to failure is not governed by just occurrence of that event alone, but many other events as well. This leads to a testing strategy wherein test cases operate over sequences than on specific identified scenarios.

Software, once shipped, needs to have many of these embedded testing strategies that manifest themselves and auto-correct the required parameters when a sequence of events manifests. Application of Hidden Markov Models in such cases is a very useful system in testing the probability of failures of such class of sequences and engage in corrective actions, without bringing the software down. A batch machine learning system that routinely consumes the logs and checks for patterns is also a possibility. This leads to a possible Lambda Architecture, wherein the Speed Layer monitors the systems in real time, whereas the Batch layer operates on a time window.


[1] “Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems” by Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm, in 11th USENIX Symposium on Operating Systems Design and Implementation

[2] “A Note on Distributed Computing” by Jim Waldo, Geoff Wyant, Ann Wollrath and Sam

[3] M. Sullivan and R. Chillarege. Software defects and their impact on system availability — A study of field failures in operating systems. In Twenty-First International Symposium on Fault-Tolerant Computing, FTCS’91, pages 2– 9, 1991.

distributed systems ,monitoring ,machine learning ,artificial intelligence ,performance ,cloud computing ,testing

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}