PagerDuty Machine Learning Capabilities Reduce Outages and Costs
Using data, automation, and machine learning to continually improve performance.
Join the DZone community and get the full member experience.Join For Free
I had the opportunity to meet with Jennifer Tejada, CEO, PagerDuty following her keynote at the PagerDuty Summit.
During her keynote, Jennifer shared her vision for real-time ops. She sees time as everyone's most valuable resource and she believes in order to be successful as individuals and companies we need to be working with others to improve work, team outcomes, and our own lives.
You might also like: New PagerDuty Solution Unifies Customer Service, Engineering, and IT
During the 10-year history of PagerDuty, we've gone from 100,000 apps to two million. PagerDuty is now processing 10 million events per day. Mean-time-to-acknowledgement is down to less than one minute and incidents per responder down from 16 to 4. Speed to market, responsiveness, are lower costs are key drivers for business today.
However, being “always-on” burns people out. 90% of companies use no automation for technical issue resolution and 51% of companies find out about a tech issue from their customers. It's incumbent on businesses to use the data they have to improve employee and customer experience.
Today, it takes companies an average of 80 minutes to coordinate response teams to solve a customer-impacting issue, such as a failed shopping cart or broken web page. The new solutions are expected to help reduce that to five minutes by providing automation to get the right people working together, with the right information, to more quickly triage issues when seconds count. By adopting real-time digital operations management practices, large companies can gain upwards of $2.5 million in IT staff productivity savings.
Intelligent Triage is a new feature set within PagerDuty’s Event Intelligence, which uses machine learning (ML) to group alerts so teams don’t receive multiple alerts coming from related issues. Triage provides additional context into the issue; e.g., whether it has happened before, how it was resolved, how widespread it is, what teams and services are affected, who is working on it and how they can be reached. This provides teams with the knowledge to help pull together the right people, with the right information, to solve problems faster, minimizing the cost of downtime and preventing poor customer experiences (CX).
"When there’s a major issue in payment processing, the signal can get lost in all the alerts from related systems,” said Square’s Software Development Manager, Payment Acceptance, Adam Edwards. “Intelligent Triage in Event Intelligence will help us see the impact across related systems in real-time and focus on the most critical issues."
Intelligent Dashboards — new to PagerDuty’s Analytics product — leverages ML to provide teams with recommendations for how to resolve issues, as well as benchmarks against performance metrics from other teams in their organization or industry so they can improve. The Spotlights recommendation engine leverages 10 years of machine and human response data to give teams context for improvements, such as stopping unactionable alerts and recognizing repeat issues.
“Nearly half of companies experience a major technology issue at least monthly,” said PagerDuty’s SVP Product, Jonathan Rende. “In today’s always-on world, slow responses damage a company’s brand, impact employees and erode the bottom line. Companies urgently need insights into how they are handling these issues so they know how to improve. With Spotlights, we are automating the provision of knowledge that is crucial to both solving problems at the moment and continually improving performance.”
Here's more information about the new capabilities:
Provides context into an issue e.g., whether it has happened before, how it was resolved, how widespread it is, what services and teams are affected, who is working on it and how they can be reached.
Provides automation to ensure teams have the knowledge required to effectively triage issues in real-time (e.g. is this a major incident? Who is needed to help?).
Reduces the impact of unplanned work by giving adjacent teams visibility so they don’t duplicate efforts or interfere with each other.
Creates significant time and cost savings — the majority of tech employees will lose 100-plus hours of productivity due to unplanned work this year.
Leverages 10 years of machine data and human response patterns, applied through Spotlight, PagerDuty’s recommendation engine that learns from past issues to make suggestions that teams can use for future improvements, such as stopping unactionable alerts, fixing repeat issues and improving escalation practices.
Includes interactive charts and graphs that, unlike static status reports, let customers drill into details by team to show incident volume, response effort, interruption volume and more.
Provides managers with built-in benchmarks to see how their teams compare to peers in the organization and their vertical industry when it comes to spotting issues, mobilizing teams and achieving resolutions.
Translates the impact of issues into business outcomes, such as total cost of incidents or response team fatigue where other solutions only have basic metrics, such as mean time to response (MTTR).
Tom’s Tech Notes: AppSec and Threat Intelligence Visibility
Opinions expressed by DZone contributors are their own.
Mastering Time Series Analysis: Techniques, Models, and Strategies
Apache Kafka vs. Message Queue: Trade-Offs, Integration, Migration
13 Impressive Ways To Improve the Developer’s Experience by Using AI