DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Key Considerations When Implementing Virtual Kubernetes Clusters
  • Top 10 Open Source Projects for SREs and DevOps
  • AI Agents for DevOps on Kubernetes Need Real Engineering, Not Magic
  • How I Fixed a Silent Production Bug in Apache Airflow That Affected Thousands of Deployments

Trending

  • 7 Technology Waves I’ve Seen in 30 Years of Software — Will AI Be the Next Real Transformation?
  • Building AI-Powered Java Applications With Jakarta EE and LangChain4j
  • DZone's Article Submission Guidelines
  • Persistent Memory for AI Agents Using LangChain's Deep Agents
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. DevOps and CI/CD
  4. How I Cut Kubernetes Debugging Time by 80% With One Bash Script

How I Cut Kubernetes Debugging Time by 80% With One Bash Script

Built a bash script that analyzes Kubernetes clusters in 60 seconds, generating HTML/JSON/Markdown reports. Saved 70 minutes daily across 8 clusters.

By 
Shamsher Khan user avatar
Shamsher Khan
DZone Core CORE ·
Nov. 20, 25 · Analysis
Likes (4)
Comment
Save
Tweet
Share
5.3K Views

Join the DZone community and get the full member experience.

Join For Free

Here's the truth about Kubernetes troubleshooting: 80% of your time goes into finding WHAT broke and WHERE it broke. Only 20% goes into actually fixing it. For months, I lived this reality, managing eight Kubernetes clusters. Every issue followed the same pattern: 30 minutes of kubectl detective work, five minutes to fix the actual problem. I was spending hours hunting for needles in haystacks. Then one weekend, I flipped that ratio.

Every Monday at 8 AM, our team's Teams chat explodes. "Hey, the dashboard is down." "Perf team can't access their pods." "Build agents crashed overnight."

Here's how I built a tool to cut through the chaos in under 60 seconds.

My Reality

Our team manages 8 Kubernetes clusters (AKS) - 4 in production and 4 in non-production (dev, e2e, perf, prodfix). When something breaks, our DevOps team gets pinged first.

The Problem: Finding WHAT broke and WHERE it broke takes 80% of the time. Fixing it takes 20%.

The Solution: A bash script that generates a complete cluster health report in under a minute.

Note: For this article, I demonstrate with a local minikube cluster, but the tool works identically on production clusters of any size.

A Typical Monday Morning

7:55 AM. Coffee in hand. I open Teams.

47 unread messages.

Plain Text
 
"Hey, dashboard is not up on perf cluster" 
"Build agent failed overnight - need this fixed ASAP"  
"E2E environment - services not coming up after weekend shutdown" 
"Scheduled job didn't run - team waiting for data"


Every. Single. Monday.

Why? Cost-saving measures. We shut down non-production clusters over the weekend. Monday morning, we bring them back up.

That's when the chaos starts.

The Real Problems I Face Daily

Problem 1: The Dashboard Mystery

The message: "Dashboard is not up on perf cluster. Team can't check pod health after testing."

My old process:

Shell
 
kubectl get namespaces | grep dashboard  
kubectl get pods -A | grep dashboard      
kubectl describe pod dashboard-xyz        
kubectl get pdb -A | grep dashboard       
kubectl get pods -n istio-system  


15 minutes later: "It's the Pod Disruption Budget. Istio only has one replica running."

The team: "Can we just get the dashboard back up?"

Problem 2: The Build Agent Nightmare

The message: "Build agents crashed overnight. Batch job failed. Team doesn't have their data."

What actually happened:

  • StatefulSet pod crashed during build
  • Cached issue? PAT token expired? Out of resources?
  • Need to check: Logs, events, node status, storage

My old process:

Shell
 
kubectl get statefulsets -A              
kubectl get pods -n build-agents         
kubectl describe pod agent-xyz           
kubectl logs agent-xyz --previous        
kubectl get nodes                        
kubectl get pvc -n build-agents        


20 minutes later: Found it. Cached image issue + node was under memory pressure.

The team: Still waiting.

Problem 3: The Secret Token Surprise

The message: "Scheduled job failed. Data didn't transfer from blob storage. Team waiting for latest data."

The actual problem: Token expired. Job couldn't authenticate. The auto-deploy pipeline didn't get the latest data.

My old process:

Shell
 
kubectl get cronjobs -A                  
kubectl get jobs -A --sort-by=.status.startTime  
kubectl describe job job-xyz             
kubectl logs job-xyz-pod   


Buried in logs: Authorization failed: Token expired

30 minutes later: Fixed the secret. Re-ran the job manually.

The team: Already missed their morning data window.

Problem 4: The PVC Mount Failure

The message: "Pod keeps restarting in production."

The actual problem: PVC mount failed. Secret expired. Pod in CrashLoopBackOff.

My old process:

Shell
 
kubectl get pods -n production           
kubectl describe pod failing-pod         
kubectl get pvc -n production            
kubectl describe pvc pvc-xyz             
kubectl get secrets -n production 


10 minutes later: Secret expired 3 days ago. Nobody noticed until it tried to restart.

The Pattern I Noticed

Every morning, the same pattern:

  1. Someone reports a problem (Teams message)
  2. I start investigating (10-30 minutes of kubectl commands)
  3. I explain what I found (Copy-paste terminal output to Teams)
  4. I fix the issue (5 minutes)
  5. I update everyone (More copy-paste)

Total time: 30-45 minutes per issue.

The frustrating part? 80% of the time was spent finding the problem, not fixing it.

What I Actually Needed

Not another monitoring tool. We have Prometheus, Grafana, and Azure Monitor.

What I needed was a snapshot:

  1. What's healthy? (Green = good)
  2. What's unhealthy? (Red = look here)
  3. What's weird? (Yellow = investigate later)
  4. Where do I focus first? (Priority issues)

And I needed it fast. Under 1 minute.

Most importantly, I needed to share this with teams that don't speak kubectl.

Kubernetes cluster


The Tool I Built

Over a weekend, I built k8s-cluster-discovery.sh.

What it does:

  • Collects data from kubectl (nodes, pods, workloads, Istio, storage)
  • Analyzes health status automatically
  • Generates reports in three formats: HTML, JSON, or Markdown
  • Takes less than 60 seconds

Works with ANY Kubernetes cluster:

  • Azure Kubernetes Service (AKS)
  • Amazon Elastic Kubernetes Service (EKS)
  • Google Kubernetes Engine (GKE)
  • Red Hat OpenShift
  • Rancher
  • Local clusters (minikube, kind, k3s)
  • On-premises Kubernetes

Platform support:

  • Linux (x86_64, ARM64)
  • macOS (Intel and Apple Silicon M1/M2/M3)
  • Windows (WSL2)

Usage:

Shell
 
# Quick health check - entire cluster 
bash k8s-cluster-discovery.sh my-cluster html 

# Focused investigation - specific namespace  
bash k8s-cluster-discovery.sh my-cluster html production 

# For automation - JSON output 
bash k8s-cluster-discovery.sh my-cluster json 

# For documentation - Markdown bash 
k8s-cluster-discovery.sh my-cluster markdown


One Tool, Three Formats

Here's what makes this tool flexible: three output formats for different needs.

Format 1: HTML (For Humans)

Shell
 
bash k8s-cluster-discovery.sh my-cluster html


Who uses it: DevOps team, development teams, management, and new team members

Why: Color-coded health indicators, interactive, shareable (email, Teams), works on mobile

Use case: "What's the cluster status?" → Send HTML file

Color-coded summary cards (Total Pods, Deployments, Istio Sidecars, Workloads)

Screenshot showing: Color-coded summary cards (Total Pods, Deployments, Istio Sidecars, Workloads)


Format 2: JSON (For Automation)

Shell
 
bash k8s-cluster-discovery.sh my-cluster json


Who uses it: CI/CD pipelines, automation scripts, custom dashboards

Use case: Pre-deployment validation

Shell
 
bash k8s-cluster-discovery.sh prod-cluster json

PROBLEMS=$(jq '[.pod_health[].problems[]] | length' report.json)

if [ "$PROBLEMS" -gt 5 ]; then
  echo "Cluster unhealthy - deployment blocked"
  exit 1
fi

echo "Cluster healthy - proceeding with deployment"


Real example: Our pipeline caught 8 failing pods before we deployed to production last week.

Format 3: Markdown (For Documentation)

Shell
 
bash k8s-cluster-discovery.sh my-cluster markdown


Who uses it: Platform team, architecture reviews, Git repositories

Use case: Weekly architecture reviews

Shell
 
bash k8s-cluster-discovery.sh prod-cluster markdown > docs/cluster-state-$(date +%Y-%m-%d).md
git add docs/
git commit -m "Weekly cluster snapshot"
git diff HEAD~1 docs/cluster-state-2025-10-31.md


Demo: My Local Cluster

For this article, I'm demonstrating with a local minikube cluster. My production environment runs larger clusters, but the tool works the same way regardless of cluster size.

Running on my demo cluster:

Shell
 
bash k8s-cluster-discovery.sh opscart html


Time taken: ~30 seconds (smaller cluster = faster).

What I got:

  • Complete health overview
  • All pods with status indicators
  • Workload analysis (Deployments, StatefulSets, CronJobs)
  • Istio service mesh configuration
  • Storage usage

All-in-one visual dashboard — whether you have 30 pods or 300 pods, the format is the same.

Full HTML report with multiple sections (Workloads, Deployments, Istio)

Screenshot showing: Full HTML report with multiple sections (Workloads, Deployments, Istio)


How This Changed My Mornings

Old Monday Morning (Before Tool)

8:00 AM: Teams messages start
8:05 AM: Start investigating perf cluster
8:15 AM: Copy kubectl output to Teams, explain what it means
8:20 AM: Fix dashboard issue
8:25 AM: New message - build agent problem
8:30 AM: Start investigating build agent
8:50 AM: Found the issue
9:00 AM: Standup meeting - still explaining cluster status
9:30 AM: Finally done with morning fires

Total: 90 minutes of firefighting

New Monday Morning (With Tool)

7:55 AM: Run health reports:

Shell
 
# Check production clusters 
opscart@mac scripts % ./k8s-cluster-discovery.sh prod-1 html             
╔════════════════════════════════════════════════════════════╗
║  Full Cluster Report
║  Cluster: prod-1
║  Output: html
╚════════════════════════════════════════════════════════════╝

Collecting cluster information...
  ✓ Version: v1.28.0
Collecting node information...
  ✓ Nodes: 1/1 ready
Collecting namespace information...
  ✓ Namespaces: 12 (3 with Istio injection)
  Checking Istio...
  ✓ Istio: 1.19.0 (Gateways: 1, VS: 2)
Collecting workloads...
  ✓ Deployments: 13 | StatefulSets: 2 | CronJobs: 3 | DaemonSets: 2
Analyzing ALL pods...
  ✓ Pods: 37 total (15 with sidecars, 1 with issues)
Collecting HPAs & PDBs...
  ✓ HPAs: 4 | PDBs: 5
Collecting storage...
  ✓ PVCs: 2

Generating report...

HTML report: ./cluster-reports/prod-01_20251103_142955.html


Shell
 
# Check non-production clusters where issues are reported 
bash k8s-cluster-discovery.sh perf html 
bash k8s-cluster-discovery.sh e2e html 
bash k8s-cluster-discovery.sh dev html


8:00 AM: Open Teams
8:01 AM: Upload HTML reports to relevant threads
8:02 AM: Dashboard issue: PDB blocking scale-down. Istio at 1 replica. Fixing now.
8:05 AM: Dashboard back up
8:06 AM: Build agent: Node memory pressure + cached image issue. Clearing cache.
8:10 AM: Build agent healthy
8:15 AM: All morning issues resolved
9:00 AM: Standup - All clusters healthy. Reports in Teams if you want details.

Total: 20 minutes of firefighting

Time saved: 70 minutes. Every single day.

The Features That Actually Matter

Note: The screenshots below are from a demo cluster running on minikube. My production clusters have higher pod counts, but the interface and functionality are identical regardless of scale.

1. Visual Health Dashboard

Open the HTML report. First thing you see:

Plain Text
 
┌──────────────┐  ┌──────────────┐  ┌──────────────┐  
│ Total Pods   │  │ Deployments  │  │ Istio        │  
│      35      │  │      13      │  │ Sidecars     │  
│ 1 with issues│  │  12 healthy  │  │  20 (57%)    │  
└──────────────┘  └──────────────┘  └──────────────┘  
    Red              Yellow            Yellow


Instant visibility: Red = problems. Yellow = warnings. Green = all good.

2. Smart Warnings

The script doesn't just collect data. It analyzes it.

Example warning:

Plain Text
 
Configuration Warning:

Namespace 'kubernetes-dashboard' has node selector set to 
'kubernetes.azure.com/mode=system'. This schedules pods on 
system node pools, which is unusual for user applications.


This caught misconfigurations in 3 namespaces we didn't know about.

3. Complete Pod Table

Every pod. Every namespace. With details:

Pod Name Status Ready Containers Sidecar Restarts
dashboard-abc Running ✓ 2/2 ✓ 0
agent-xyz Failed ✗ 0/1 ✗ 47
job-123 Pending ✗ 0/1 ✗ 0


One glance shows exactly which pods need attention.

Pod table with status indicators and details

Screenshot showing: Pod table with status indicators and details


4. Istio Mesh Visibility

  • Which pods have sidecars injected?
  • Which pods are missing sidecars?
  • Gateway configurations
  • VirtualService status

5. Cluster Cleanup Insights

Bonus benefit: The report shows unused PVCs, jobs that haven't run in days, CronJobs with high failure rates, and deployments with zero replicas.

Perfect for monthly cleanup and cost optimization.

Real Impact on My Team

For Our DevOps Team

Before: "Let me check... give me 15 minutes..."
After: "See report. Dashboard issue = PDB. Fixing now."

Before: 2 hours/day on cluster health checks
After: 20 minutes/day

Team adoption: All team members now use the tool for their daily health checks.

For Development Teams

Before: "What's wrong with perf cluster?"
Me: [Sends kubectl output]
Them: "What does this mean?"

After: "What's wrong with perf cluster?"
Me: [Sends HTML report]
Them: "Oh, 3 red pods. Got it."

For Management

Before: "How are the clusters?"
Me: [Tries to explain pod states]
Them: [Confused faces]

After: "How are the clusters?"
Me: [Sends HTML report]
Them: "Mostly green. Looks good."

They actually understand cluster health now.


Why This Works for Multi-Cluster Management

Managing 8 clusters (4 production, 4 non-production) across our team means 8 different configurations to track.

New way:

Shell
 
# Production clusters
for cluster in prod-1 prod-2 prod-3 prod-4; do
  bash k8s-cluster-discovery.sh $cluster html
done

# Non-production clusters
for cluster in dev e2e perf prodfix; do
  bash k8s-cluster-discovery.sh $cluster html
done


5 minutes to understand the health of all 8 clusters.

Technical Details

Data Collection

Uses kubectl to collect:

  • All nodes with status, taints, labels
  • All namespaces with Istio injection status
  • All workloads (Deployments, StatefulSets, CronJobs, DaemonSets)
  • Every pod with container counts, restart counts, sidecar status
  • Istio resources
  • HPAs and PDBs
  • PVCs

Health Analysis

Automatically detects:

  • Failed or pending pods
  • Pods with high restart counts
  • Unusual configurations
  • Unhealthy workloads
  • Missing Istio sidecars

Report Generation

Generates a self-contained HTML file with:

  • Embedded CSS (color-coded design)
  • Minimal JavaScript (collapsible sections)
  • No external dependencies
  • Works offline
  • Can be emailed or uploaded to SharePoint

File size: 150-500 KB depending on cluster size

Performance

Demo cluster (opscart - ~40 pods):

  • Data collection: ~20 seconds
  • Report generation: ~10 seconds
  • Total: ~30 seconds

Production cluster (~250 pods):

  • Data collection: ~45 seconds
  • Report generation: ~12 seconds
  • Total: ~60 seconds

The tool scales linearly with cluster size.

Real Use Cases Beyond Health Checks

1. Pre-Deployment Validation

Shell
 
bash k8s-cluster-discovery.sh prod-cluster json
PROBLEMS=$(jq '[.pod_health[].problems[]] | length' report.json)

if [ "$PROBLEMS" -gt 5 ]; then
  echo "Cluster unhealthy - deployment blocked"
  exit 1
fii


Saved us from deploying to an unhealthy cluster twice last month.

2. Cost Optimization

Run monthly reports. Look for:

  • Unused PVCs
  • Deployments with 0 replicas
  • CronJobs that haven't succeeded in 30+ days

Last month: Found 15 unused PVCs. Deleted them. Saved $200/month.

3. Onboarding New Engineers

Before: "Here's kubectl... these 20 commands..."
After: "Here's this morning's report. Green = healthy, red = issues."

New engineers understand cluster health in 5 minutes instead of 5 days.

4. Incident Documentation

During incidents, generate report:

Shell
 
bash k8s-cluster-discovery.sh prod-cluster html


Save the HTML file. Archive it.

Post-mortem meetings: "Here's the cluster state at the time of incident."

What I Learned

1. Simple Tools Win

I could have built a React dashboard with a back-end API.

I didn't need to.

A bash script and three output formats solved the problem perfectly.

2. Share Everything

The game-changer wasn't collecting the data.

It was making it shareable.

HTML files can be:

  • Opened by anyone
  • Shared in Teams
  • Emailed to management
  • Archived for compliance
  • Viewed on mobile

3. Visibility Reduces Chaos

Before: Everyone asks me for status
After: Everyone looks at reports

Fewer interruptions = more time to fix actual problems.

Why Not Use Other Tools?

kubectl

Good for: Powerful, flexible command-line tool for Kubernetes management
Not good for: Terminal output isn't shareable with non-DevOps teams, requires deep Kubernetes knowledge
Bottom line: Great for operations, but you can't send kubectl output to your manager or development team.

k9s

Good for: Interactive, real-time cluster monitoring with fast keyboard navigation
Not good for: No way to generate shareable reports or export snapshots for later review
Bottom line: Excellent for live debugging, but useless when you need to share cluster status with others.

Lens

Good for: Full-featured Kubernetes IDE with GUI, multi-cluster support, and built-in tools
Not good for: Requires installation, heavy on resources, not designed for quick health snapshots
Bottom line: Perfect as your daily K8s environment, but overkill when you just need a fast cluster overview.

Prometheus + Grafana

Good for: Real-time metrics, historical data, powerful alerting, and custom dashboards
Not good for: Complex setup, requires dedicated infrastructure, not built for quick one-time snapshots
Bottom line: Essential for production monitoring, but too heavy when you just want to answer "Is the cluster healthy right now?"

Kubevious

Good for: Visual cluster topology, rule-based validation, and change tracking
Not good for: Requires installation and setup, web UI only, heavier footprint than a simple script
Bottom line: Great for deep analysis and governance, but too much when you need a quick morning health check.

KubeEye

Good for: Security scanning, best practice checks, and compliance validation
Not good for: Command-line only, focused on finding problems rather than overall health overview
Bottom line: Excellent for audits, but doesn't give you the big picture of cluster health at a glance.

What I needed: Simple health snapshots + shareability + automation.

This tool fills the gap between "quick kubectl commands" and "full monitoring infrastructure."

Open Source and Available Now

The tool is open source on GitHub.

Quick Start

Shell
 
git clone https://github.com/opscart/kubectl-health-snapshot.git
cd kubectl-health-snapshot/scripts
chmod +x k8s-cluster-discovery.sh


Requirements

  • kubectl configured for your cluster (any Kubernetes distribution)
  • jq JSON processor
  • Bash 4.0+

Works On

  • Linux (Ubuntu, RHEL, Debian, etc.)
  • macOS (Intel and Apple Silicon M1/M2/M3)
  • Windows (via WSL2)

Tested With

  • Azure AKS, Amazon EKS, Google GKE
  • Minikube, Kind, K3s, Docker Desktop

No complex installation. No servers. No configuration.

What's Next

Near term:

  • Multi-cluster comparison view
  • Scheduled pipeline integration
  • Cost analysis per namespace

Community requests:

  • YAML output for GitOps
  • Historical trending
  • Teams webhook notifications

Want to contribute? Issues and PRs welcome on GitHub.

The Bottom Line

Every morning used to start with chaos:

  • Teams messages
  • kubectl detective work
  • Copy-paste output
  • Repeat for 8 clusters

Now:

Shell
 
bash k8s-cluster-discovery.sh my-cluster html


60 seconds later: Complete health report. Visual. Shareable. Actionable.

Key Takeaways

  • 80% of Kubernetes debugging is detection, not fixing. Most of the time is wasted finding problems, not solving them.
  • A simple bash script beats complex tools. kubectl + jq + 60 seconds replaces hours of commands and multiple monitoring tools.
  • Share-ability bridges the DevOps gap. HTML reports let non-technical teams understand cluster health without kubectl expertise.
  • Zero setup, maximum impact. No installation, no infrastructure, no configuration — just run it.
  • Open source and ready to use. Try it in 5 minutes: .

Try It Yourself

GitHub: github.com/opscart/kubectl-health-snapshot
Author: opscart - DevOps Tools & Automation
License: MIT

If you manage Kubernetes clusters, what does your Monday morning look like? How do you handle the firefighting? Share your thoughts in the comments.

Try the tool today: https://github.com/opscart/kubectl-health-snapshot. Star the repo if it saves you time! 

Kubernetes DevOps Open source

Opinions expressed by DZone contributors are their own.

Related

  • Key Considerations When Implementing Virtual Kubernetes Clusters
  • Top 10 Open Source Projects for SREs and DevOps
  • AI Agents for DevOps on Kubernetes Need Real Engineering, Not Magic
  • How I Fixed a Silent Production Bug in Apache Airflow That Affected Thousands of Deployments

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook