DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Join us tomorrow at 1 PM EST: "3-Step Approach to Comprehensive Runtime Application Security"
Save your seat
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. DevOps and CI/CD
  4. Increasing the Dependability of DevOps Processes

Increasing the Dependability of DevOps Processes

Where do problems arise that cause system outages? What can be done to improve processes to reduce system outages?

Derek Weeks user avatar by
Derek Weeks
·
May. 29, 18 · Analysis
Like (1)
Save
Tweet
Share
5.44K Views

Join the DZone community and get the full member experience.

Join For Free

For many users, software often isn't really appreciated until you don't have it. In this day, its constant availability has become a given, but, of course, 100% availability isn't really a reality. That is why when high-profile systems, like Netflix or AWS, have outages, it makes national news. Most of us don't work on systems that garner national user bases, but our users are just as important. So, we work hard to reduce system outages.

Where do problems arise that cause system outages? What can be done to improve processes to reduce system outages?

Researchers see that system outages often stem from problems during operations processes, such as upgrading software. Dr. Ingo Weber ( @ingomweber ) is one of those researchers. He is a principal research scientist and team leader at Data 61, a part of CSIRO Australia's government-funded research body. He and his fellow researchers developed an approach and tool framework, Process-Oriented Dependability (POD), to address this challenge in DevOps practices. POD enables fast error detection, root cause analysis, and recovery.

Ingo shared his insights on POD during his talk, Increasing the Dependability of DevOps Processes, describing the approach, tool, and some key findings.

Ingo set the stage by quoting a Gartner study showing that, "80% of outages impacting mission-critical services will be caused by people and process issues." Thus, showing that by addressing process issues, you can significantly reduce system outages.

He also notes that with significantly shorter release cycles, moving from months between releases and scheduled downtime to continuous delivery and releases delivered in hours or days, magnifies the potential issues. As an example, he notes that Etsy has an average of 25 full deployments/day and 10 commits per deployment. Because of this, baseline-based anomaly detection no longer works because of cloud uncertainty and continuous changes, such as multiple sporadic operations at all times, scaling in/out, snapshots, migrations, reconfigurations, rolling upgrades, and cron-jobs.

The POD approach at a high-level is:

  • Increase dependability during Operation time through:
    • More accurate performance monitoring
    • Faster error detection
    • Fast or autonomous healing (quick fix)
    • Root cause diagnosis to figure out what the actual problem is
    • Guided or autonomous recovery
  • Incorporating change-related knowledge into system management

Digging a little deeper into POD, Ingo talks about two approaches they use: Conformance Checking and Assertion Evaluation.

There are three levels of Conformance Checking:

  • Basic
  • Detecting numerical invariants
  • Detecting timing anomalies

When errors/anomalies are detected, an alert is raised, and all results are visualized through POD-Viz, the dashboard.

Conformance Checking can detect the following types of errors:

  • Unknown/error log line: a log line that corresponds to a known error, or is simply unknown
  • Unfit: a log line corresponds to a known activity, but said activity should not happen in the current execution state of the process instance

All other log lines are deemed fit. The goal is 100% fit, otherwise raise an alert and learn from false alerts to improve classification and/or the model.

Assertion Evaluation creates and checks against assertions. Assertions check if the actual state, at a given point, is the expected state. They are coded against cloud APIs so they can find out the true state of resources directly. You also identify the main factors affecting a resource and identify the log events that have the most important influence on changing the state of a system resource. Look at the metrics and chose whose are most relevant, then derive a formula that can be used to estimate the value of a variable associated with a system's resource so that you can test it against a range of acceptable values. Then, you drive an assertion based on this.

Does Process-Oriented Dependability sound like something you might want to implement or consider further? Ingo's full talk is especially geared toward practitioners, diving into more detail and examples. It is available to view in its entirety for free here or just check the embedded video below. Additionally, he suggested two papers: Process-Oriented Dependability and Software Performance Engineering in the DevOps World.


Craving more on DevOps practices, binge watch any of the 157 practitioner-led sessions, free of charge, at All Day DevOps.

All Day DevOps 2018

All Day DevOps 2018 is just around the corner! Registration is available here.

The free, online conference goes live on October 17th, offering 100 different practitioner-led sessions, each one 30-minutes long. With 5 separate tracks: CI/CD, Cloud-Native Infrastructure, DevSecOps, Cultural Transformations, & Site Reliability Engineering, and 100 speakers, there's sure to be something for everyone.

And speaking of everyone, if you're part of an organization with 20+ people that want to attend the conference (again, it's free!) then you should consider joining the Club 20 program so that you might get your company logo added to the ADDO site. Check out some of the Club 20 participants here and consider joining them.

Hope to see you online at the show!

DevOps Dependability

Published at DZone with permission of Derek Weeks, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • ChatGPT Prompts for Agile Practitioners
  • The Future of Cloud Engineering Evolves
  • Kotlin Is More Fun Than Java And This Is a Big Deal
  • The Quest for REST

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: