DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

How does AI transform chaos engineering from an experiment into a critical capability? Learn how to effectively operationalize the chaos.

Data quality isn't just a technical issue: It impacts an organization's compliance, operational efficiency, and customer satisfaction.

Are you a front-end or full-stack developer frustrated by front-end distractions? Learn to move forward with tooling and clear boundaries.

Developer Experience: Demand to support engineering teams has risen, and there is a shift from traditional DevOps to workflow improvements.

Related

  • Chaos Mesh — A Solution for System Resiliency on Kubernetes
  • Chaos Engineering for Microservices
  • The Production-Ready Kubernetes Service Checklist
  • Optimizing Prometheus Queries With PromQL

Trending

  • The Missing Infrastructure Layer: Why AI's Next Evolution Requires Distributed Systems Thinking
  • AI-Native Platforms: The Unstoppable Alliance of GenAI and Platform Engineering
  • Modernizing Financial Systems: The Critical Role of Cloud-Based Microservices Optimization
  • Misunderstanding Agile: Bridging The Gap With A Kaizen Mindset
  1. DZone
  2. Coding
  3. Tools
  4. Chaos Engineering With Litmus: A CNCF Incubating Project

Chaos Engineering With Litmus: A CNCF Incubating Project

LitmusChaos helps identify weaknesses in system resilience by injecting faults like pod deletion, network latency, and resource exhaustion into applications.

By 
Sai Sandeep Ogety user avatar
Sai Sandeep Ogety
DZone Core CORE ·
Feb. 06, 25 · Tutorial
Likes (7)
Comment
Save
Tweet
Share
5.9K Views

Join the DZone community and get the full member experience.

Join For Free

Problem statement: Ensuring the resilience of a microservices-based e-commerce platform.

System resilience stands as the key requirement for e-commerce platforms during scaling operations to keep services operational and deliver performance excellence to users. We have developed a microservices architecture platform that encounters sporadic system failures when faced with heavy traffic events. The problems with degraded service availability along with revenue impact occur mainly because of Kubernetes pod crashes along with resource exhaustion and network disruptions that hit during peak shopping seasons.

The organization plans to utilize the CNCF-incubated project Litmus for conducting assessments and resilience enhancements of the platform. Our system weakness points become clearer when we conduct simulated failure tests using Litmus, which allows us to trigger real-world failure situations like pod termination events and network delays, and resource usage limits. The experiments enable us to validate scalability automation systems while testing disaster recovery procedures and maximize Kubernetes settings toward total system reliability.

The system creates a solid foundation to endure failure situations and distribute busy traffic periods safely without deteriorating user experience quality. Chaos engineering applied proactively to our infrastructure enables better risk reduction and increased observability, which allows us to develop automated recovery methods that enhance our platform's e-commerce resilience to every operational condition.

Set Up the Chaos Experiment Environment

Install LitmusChaos in your Kubernetes cluster:

Shell
 
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update
helm install litmus litmuschaos/litmus


Verify installation:

Shell
 
kubectl get pods -n litmus


Note: Ensure your cluster is ready for chaos experiments.

Define the Chaos Experiment

Create a ChaosExperiment YAML file to simulate a Pod Delete scenario.

Example (pod-delete.yaml):

YAML
 
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: pod-delete
  namespace: litmus
spec:
  definition:
    scope: Namespaced
    permissions:
      - apiGroups: ["*"]
        resources: ["*"]
        verbs: ["*"]
    image: "litmuschaos/go-runner:latest"
    args:
      - -c
      - ./experiments/generic/pod_delete/pod_delete.test
    command:
      - /bin/bash


Install ChaosOperator and Configure Service Account

Deploy ChaosOperator to manage experiments:

Shell
 
kubectl apply -f https://raw.githubusercontent.com/litmuschaos/litmus/master/litmus-operator/cluster-k8s.yml


Note: Create a ServiceAccount to grant necessary permissions.

Inject Chaos into the Target Application

Label the application namespace for chaos:

Shell
 
kubectl label namespace <target-namespace> litmuschaos.io/chaos=enabled


Deploy a ChaosEngine to trigger the experiment:

Example (chaosengine.yaml):

YAML
 
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-engine
  namespace: <target-namespace>
spec:
  appinfo:
    appns: '<target-namespace>'
    applabel: 'app=<your-app-label>'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  monitoring: false
  experiments:
    - name: pod-delete


Apply the ChaosEngine:

Shell
 
kubectl apply -f chaosengine.yaml


Monitor the Experiment

View the progress:

Shell
 
kubectl describe chaosengine pod-delete-engine -n <target-namespace>


Check the status of the chaos pods:

Shell
 
kubectl get pods -n <target-namespace>


Analyze the Results

Post-experiment, review logs and metrics to determine if the application recovered automatically or failed under stress.

Here are some metrics to monitor:

  • Application response time
  • Error rates during and after the experiment
  • Time taken for pods to recover

Solution

Root cause identified: During high traffic, pods failed due to an insufficient number of replicas in the deployment and improper resource limits.

Fixes applied:

  • Increased the number of replicas in the deployment to handle higher traffic
  • Configured proper resource requests and limits for CPU and memory in the pod specification
  • Implemented a Horizontal Pod Autoscaler (HPA) to handle traffic spikes dynamically

Conclusion

By using LitmusChaos to simulate pod failures, we identified key weaknesses in the e-commerce platform’s Kubernetes deployment. The chaos experiment demonstrated that resilience can be significantly improved with scaling and resource allocation adjustments. Chaos engineering enabled proactive system hardening, leading to better uptime and customer satisfaction.

Chaos engineering Kubernetes pods

Opinions expressed by DZone contributors are their own.

Related

  • Chaos Mesh — A Solution for System Resiliency on Kubernetes
  • Chaos Engineering for Microservices
  • The Production-Ready Kubernetes Service Checklist
  • Optimizing Prometheus Queries With PromQL

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: