DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • A Comprehensive Guide to Generative AI Training
  • Using Snowflake Cortex for GenAI
  • Redefining Ethical Web Scraping in the Wake of the Generative AI Boom
  • Exploring Foundations of Large Language Models (LLMs): Tokenization and Embeddings

Trending

  • The Cypress Edge: Next-Level Testing Strategies for React Developers
  • Start Coding With Google Cloud Workstations
  • Measuring the Impact of AI on Software Engineering Productivity
  • How Large Tech Companies Architect Resilient Systems for Millions of Users
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. How Generative AI Is Revolutionizing Cloud Operations

How Generative AI Is Revolutionizing Cloud Operations

Generative AI is transforming how tech companies approach cloud reliability and operations. In this article, we explore the most compelling applications.

By 
Aditya Visweswaran user avatar
Aditya Visweswaran
·
Feb. 25, 25 · Analysis
Likes (0)
Comment
Save
Tweet
Share
2.5K Views

Join the DZone community and get the full member experience.

Join For Free

LLMs have made it possible to operate cloud services more effectively and cheaply than ever before. They can assimilate natural language and code, enabling new preventative and remediatory tools. Language models are improving at a breakneck velocity. As the models get better, services that have integrated them into their operations will reap the benefits for free. 

We explore the most compelling applications in this article, many of which are already being deployed at top tech companies. 

Code Vulnerability Scanning

Language models digest code in a more substantive way than conventional analyzers. They can power scans across a codebase to identify common vulnerabilities such as misconfigured retry logic, lax timeouts, and improper exception handling. The model can also suggest code edits to fix the vulnerability. 

This will catch pre-existing vulnerabilities, but it’s also valuable to integrate language models into the code submission tool. Whenever a new code change is proposed, the model will flag any vulnerabilities and suggest edits to the author. 

At top tech companies, integrating language models into the code submission process is a major area of investment. 

Log Analysis

The root cause of an ongoing incident is often buried away in a mountain of irrelevant logs, a needle in a haystack. An LLM-powered search (using RAG) can help on-calls get to the bottom of an issue in seconds instead of hours. The model will assess the logs against the symptoms of the incident, and report the entries that are most likely to be relevant. The model can be prompted by the on-call, or even directly integrated with the issue tracking system, such that it auto-posts its log analysis to any new ticket. 

Another application of log analysis is in change safety. The model can sample logs periodically, and automatically trigger a rollback of any ongoing change if it detects a suspicious new error.

On-Call Assistance

On-call training is an imprecise and messy process. New on-calls are only exposed to recent issues and rarely have the breadth of systemic understanding needed to handle novel problems. They mostly learn on the fly, which increases risk exposure in addition to overwhelming the on-call. 

Language models can pattern-match new issues to older ones, and assimilate service documentation quickly. An effective strategy is to fine-tune the model on past issues, and the service’s runbooks and documentation. The fine-tuned model can be used as an assistant to recommend actions for any incoming issues and even prepare commands for the on-call to execute.

On-calls spend a lot of time searching for the right procedure or the relevant context on the impacted service; smart assistants accelerate that process dramatically. The assistant can even generate new procedures or runbook entries after an issue is resolved, creating a cycle of self-improvement in incident handling. 

Incident Tracking

Complex incidents often last several hours, with multiple engineers and leaders on an incident call. Many of the finer details of how the incident was handled are lost due to imperfect note-taking. Reconstructing this information for the post-mortem takes up valuable engineering bandwidth. 

An emerging paradigm is to integrate speech-to-text with the live call and summarize the output with a language model. This creates a detailed breakdown of the incident timeline, improving post-mortem accuracy while also reducing the time spent on timeline reconstruction. 

The incident tracker can also update the central bug with any new insights from the live call. For instance, if it is established on the incident call that recovery will take 30 minutes, the system can automatically post this to the bug summary. This improves status visibility to key stakeholders while freeing up engineers to focus on remediating the issue. 

Issue Prioritization

It is typical for on-calls to have more bugs than they can handle. They use their judgment to identify which bugs require their attention. This is an imperfect process — it’s not unusual to have an outage, and realize afterward that there were early warning signs in a neglected issue. 

Language models can scan all the bugs and categorize them as innocuous or concerning, and also explain why a particular bug is important (or not). They can even estimate how much time a particular issue is likely to take based on similar issues in the past. 

Eventually, we will have LLM-powered bots handling straightforward bugs on their own, allowing on-calls to focus on the more complex issues.

Conclusion

To summarize, there are many low-hanging fruit for optimizing cloud operations in the ongoing AI revolution:

  • Prevent issues before they occur through code analysis for reliability errors
  • Detect issues and anomalies rapidly through intelligent log analysis
  • Boost on-call issue handling through smart AI assistants
  • Track complex incidents with AI 
  • Triage and prioritize issues with AI so that on-calls are focused on the most important issues

With recent advances in LLMs and AI in general, there are abundant opportunities across the stack for improving operational efficiency and resilience. New companies, especially ones building AI-based products, should be on the lookout for such opportunities. There are a lot of synergies between leveraging AI to deliver value to customers and leveraging it to improve the operations of the product itself.

AI Cloud generative AI large language model

Opinions expressed by DZone contributors are their own.

Related

  • A Comprehensive Guide to Generative AI Training
  • Using Snowflake Cortex for GenAI
  • Redefining Ethical Web Scraping in the Wake of the Generative AI Boom
  • Exploring Foundations of Large Language Models (LLMs): Tokenization and Embeddings

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!