How Generative AI Is Revolutionizing Cloud Operations
Generative AI is transforming how tech companies approach cloud reliability and operations. In this article, we explore the most compelling applications.
Join the DZone community and get the full member experience.
Join For FreeLLMs have made it possible to operate cloud services more effectively and cheaply than ever before. They can assimilate natural language and code, enabling new preventative and remediatory tools. Language models are improving at a breakneck velocity. As the models get better, services that have integrated them into their operations will reap the benefits for free.
We explore the most compelling applications in this article, many of which are already being deployed at top tech companies.
Code Vulnerability Scanning
Language models digest code in a more substantive way than conventional analyzers. They can power scans across a codebase to identify common vulnerabilities such as misconfigured retry logic, lax timeouts, and improper exception handling. The model can also suggest code edits to fix the vulnerability.
This will catch pre-existing vulnerabilities, but it’s also valuable to integrate language models into the code submission tool. Whenever a new code change is proposed, the model will flag any vulnerabilities and suggest edits to the author.
At top tech companies, integrating language models into the code submission process is a major area of investment.
Log Analysis
The root cause of an ongoing incident is often buried away in a mountain of irrelevant logs, a needle in a haystack. An LLM-powered search (using RAG) can help on-calls get to the bottom of an issue in seconds instead of hours. The model will assess the logs against the symptoms of the incident, and report the entries that are most likely to be relevant. The model can be prompted by the on-call, or even directly integrated with the issue tracking system, such that it auto-posts its log analysis to any new ticket.
Another application of log analysis is in change safety. The model can sample logs periodically, and automatically trigger a rollback of any ongoing change if it detects a suspicious new error.
On-Call Assistance
On-call training is an imprecise and messy process. New on-calls are only exposed to recent issues and rarely have the breadth of systemic understanding needed to handle novel problems. They mostly learn on the fly, which increases risk exposure in addition to overwhelming the on-call.
Language models can pattern-match new issues to older ones, and assimilate service documentation quickly. An effective strategy is to fine-tune the model on past issues, and the service’s runbooks and documentation. The fine-tuned model can be used as an assistant to recommend actions for any incoming issues and even prepare commands for the on-call to execute.
On-calls spend a lot of time searching for the right procedure or the relevant context on the impacted service; smart assistants accelerate that process dramatically. The assistant can even generate new procedures or runbook entries after an issue is resolved, creating a cycle of self-improvement in incident handling.
Incident Tracking
Complex incidents often last several hours, with multiple engineers and leaders on an incident call. Many of the finer details of how the incident was handled are lost due to imperfect note-taking. Reconstructing this information for the post-mortem takes up valuable engineering bandwidth.
An emerging paradigm is to integrate speech-to-text with the live call and summarize the output with a language model. This creates a detailed breakdown of the incident timeline, improving post-mortem accuracy while also reducing the time spent on timeline reconstruction.
The incident tracker can also update the central bug with any new insights from the live call. For instance, if it is established on the incident call that recovery will take 30 minutes, the system can automatically post this to the bug summary. This improves status visibility to key stakeholders while freeing up engineers to focus on remediating the issue.
Issue Prioritization
It is typical for on-calls to have more bugs than they can handle. They use their judgment to identify which bugs require their attention. This is an imperfect process — it’s not unusual to have an outage, and realize afterward that there were early warning signs in a neglected issue.
Language models can scan all the bugs and categorize them as innocuous or concerning, and also explain why a particular bug is important (or not). They can even estimate how much time a particular issue is likely to take based on similar issues in the past.
Eventually, we will have LLM-powered bots handling straightforward bugs on their own, allowing on-calls to focus on the more complex issues.
Conclusion
To summarize, there are many low-hanging fruit for optimizing cloud operations in the ongoing AI revolution:
- Prevent issues before they occur through code analysis for reliability errors
- Detect issues and anomalies rapidly through intelligent log analysis
- Boost on-call issue handling through smart AI assistants
- Track complex incidents with AI
- Triage and prioritize issues with AI so that on-calls are focused on the most important issues
With recent advances in LLMs and AI in general, there are abundant opportunities across the stack for improving operational efficiency and resilience. New companies, especially ones building AI-based products, should be on the lookout for such opportunities. There are a lot of synergies between leveraging AI to deliver value to customers and leveraging it to improve the operations of the product itself.
Opinions expressed by DZone contributors are their own.
Comments