DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • AI-Driven Kubernetes Troubleshooting With DeepSeek and k8sgpt
  • Building Scalable AI-Driven Microservices With Kubernetes and Kafka
  • AI/ML Innovation in the Kubernetes Ecosystem
  • Artificial Intelligence and Machine Learning in Cloud-Native Environments

Trending

  • Accelerating Debugging in Integration Testing: An Efficient Search-Based Workflow for Impact Localization
  • Enhancing Business Decision-Making Through Advanced Data Visualization Techniques
  • Event Driven Architecture (EDA) - Optimizer or Complicator
  • How to Create a Successful API Ecosystem
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. AIOps Now: Scaling Kubernetes With AI and Machine Learning

AIOps Now: Scaling Kubernetes With AI and Machine Learning

Using AI and digital twins, optimize Kubernetes apps and address SRE challenges with continuous learning for improved outcomes.

By 
Raj Nair user avatar
Raj Nair
·
Feb. 04, 24 · Analysis
Likes (1)
Comment
Save
Tweet
Share
5.4K Views

Join the DZone community and get the full member experience.

Join For Free

If you are a site reliability engineer (SRE) for a large Kubernetes-powered application, optimizing resources and performance is a daunting job. Some spikes, like a busy shopping day, are things you can broadly schedule, but, if done right, would require painstakingly understanding the behavior of hundreds of microservices and their interdependence that has to be re-evaluated with each new release — not a very scalable approach, let alone the monotony and resulting stress to the SRE. Moreover, there will always be unexpected peaks to respond to. Continually keeping tabs on performance and putting the optimal amount of resources in the right place is essentially impossible. 

The way this is being solved now is through gross overprovisioning, or a combination of guesswork and endless alerts — requiring support teams to review and intervene. It’s simply not sustainable or practical, and certainly not scalable. But it’s just the kind of problem that machine learning and AI thrives on. We have spent the last decade dealing with such problems, and the arrival of the latest generation of AI tools such as generative AI has opened the possibility of applying machine learning to the real problems of the SRE to realize the promise of AIOps.

Turning Up the Compute Knob…to Be Safe

No matter how great your observability dashboard, the amount of data and the need for agility is just too much. You have to provision adequate resources to achieve the desired response times and error rates. It is not unusual for people in this role to peg compute utilization at 30 percent “to be safe” and be prepared to monitor hundreds of microservices to ensure the desired service-level agreement (SLA) is achieved. The end result is costly — not just from compute resources, but also DevOps resources dedicated to maintaining the SLA. 

It seems that, for all it has brought us, Kubernetes has gone beyond the comprehension of those charged with operating it. Horizontal pod autoscaling (HPA) and reactive scaling solutions still leave the SREs guessing at what level to set the CPU utilization threshold that would work for various traffic loads and service graph dependencies. Traffic does not have a linear relationship to microservice loading and thus to performance, and that is not the only reason to change the states of the application deployment. SREs are also monitoring issues like temperature, faults, and latency. 

For a typical Kubernetes application, there are on average several hundreds of microservices. Furthermore, each microservice is dependent on other microservices in a web of interconnected relationships with other microservices. It is not easy for a person to view and understand it all and then make detailed changes and do this repeatedly for every release of each microservice every week. SREs figuratively “turn up the compute knob” and hope that it improves whatever has dropped below the service-level objective (SLO). But, the reality is that it is useless to increase resources at a microservice which is dependent on another microservice, which is actually the bottleneck. 

An Ideal Use Case for AI

In 2024, when someone says AI, the next thought is almost inevitably ChatGPT. ChatGPT is generative AI that selects the best next word. While the architecture required for a strong AIOps platform is very different from ChatGPT (more on that later), the goal is similar — choose the best next state for the application.

The intricately interconnected ecosystems of modern microservice applications are too big and complex for the SRE team to comprehend in detail and make those decisions. Most efforts to autoscale these applications fail to take into account the nuanced requirements and performance needs of individual services. I’ve been hearing about this problem continuously for over 20 years (starting with the L5 network load balancer we invented at Arrowpoint Communications). 

The Digital Twin Goes Through the Paces

Training data is the fuel for AI. To teach an application to operate a mission critical Kubernetes instance, we need to develop good information about how the performance can be optimized. Digital twins have been used for decades in multiple fields including manufacturing and racing to help people recreate a digital equivalent of the real subject to study its behavior. In our case, we use performance metrics to build a digital twin of each microservice. 

In reinforcement learning (RL), digital twins are used to create a simulation environment to generate an observation space in which a model can be trained to discover and learn the best paths (also known as "trajectories") to guide the system to states that have the desired target properties in terms of cost, performance, etc. In our case, we use proximal policy optimization (PPO) as the RL training algorithm. Our approach is service-graph aware to take into account the dependencies of microservices that impact scaling. Ultimately, we will have a model-free network that is continually learning based on operational experience. 

Better Responsiveness and Ongoing Improvement

Kubernetes has come a long way. There is extensive tool-level automation, but not a lot of effective system-level automation. Perhaps that has a lot to do with the vast amount of activity within a Kubernetes instance. We boiled the problem down to deciding the best next state for the application. 

People have been playing with generative AI that can produce words and images for a general audience. We are seeing how the same technology can transform our digital experience. 

For SREs Now and Developers of the Future

SREs today could benefit from a transformation. Talking to SRE teams, we have learned that they are asked to contribute to their own SLOs and they simply don’t know where to begin. It seems that the complexity of Kubernetes has outpaced the ability of humans alone to operate it. 

Looking ahead, applying AIOps models and moving toward autonomous infrastructure can allow for a new level of complexity and scale for microservices applications.

AI Kubernetes Machine learning Site reliability engineering microservice

Published at DZone with permission of Raj Nair. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • AI-Driven Kubernetes Troubleshooting With DeepSeek and k8sgpt
  • Building Scalable AI-Driven Microservices With Kubernetes and Kafka
  • AI/ML Innovation in the Kubernetes Ecosystem
  • Artificial Intelligence and Machine Learning in Cloud-Native Environments

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!