DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Real-Time Communication Protocols: A Developer's Guide With JavaScript
  • Janus vs. MediaSoup: The Ultimate Guide To Choosing Your WebRTC Server
  • The Battle Of New Industry Standards: Is WebRTC Making Zoom Redundant?

Trending

  • The Hidden Cost of Overprivileged Tokens: Designing Messaging Platforms That Assume Compromise
  • Dear Micromanager: Your Distrust Has a Job; It’s Just Not the One You’re Doing
  • From AI Chaos to Control: Building Enterprise-Grade LLM Gateways With MuleSoft Anypoint
  • Key Takeaways From Integrating a RAG Application With LangSmith
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Deployment
  4. WebRTC at Scale: Docker, GPU Nodes, Prometheus, and Latency-Based Autoscaling on GKE

WebRTC at Scale: Docker, GPU Nodes, Prometheus, and Latency-Based Autoscaling on GKE

WebRTC needs more than CPU scaling. Discover how GKE, Docker, GPUs, and Prometheus enable latency-driven autoscaling for faster, smoother, real-time applications.

By 
Pragya Keshap user avatar
Pragya Keshap
·
Nov. 11, 25 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
1.6K Views

Join the DZone community and get the full member experience.

Join For Free

Real-time apps are now part of daily life. We use them for video calls, live classes, online games, and health checkups. These apps need to respond fast. Even a small delay in sound or video makes them hard to use.

WebRTC is the open standard that powers most of these apps. It runs in browsers and on mobile devices, allowing direct audio, video, and data connections. But scaling WebRTC apps in the cloud is tricky.

Why? Because these apps are heavy on compute. They use video encoding, decoding, and sometimes extra features like background blur or translation. CPUs alone are not always enough to handle the load. And when usage grows quickly, users will notice lag before the system begins to show signs of trouble.

In this article, I will show how I built a Dockerized WebRTC application on Google Kubernetes Engine (GKE) with GPU nodes, Prometheus and Grafana monitoring, and autoscaling based on latency. I will also share lessons I learned and scripts I wrote to save costs.

Why CPU Metrics Are Not Enough

Most Kubernetes setups use the Horizontal Pod Autoscaler (HPA) with CPU utilization as the metric. The HPA checks CPU or memory use and adds pods when the usage passes a limit. This is fine for many apps. But it does not work well for WebRTC.

When I tested my demo app, CPU use stayed at normal levels. The HPA did not add more pods until usage hit 70%. By that time, video quality was already poor. The delay reached 300 - 400ms. Frames dropped. Audio broke up.

Users don’t care about CPU use. They care about smooth calls. This gap between system metrics and user experience is the main problem.

The solution is to monitor latency, not CPU usage. Latency shows how long it takes the app to respond. A spike in latency tells us users are suffering, even if CPU is fine.

The Setup

Here’s the system I built:

  • Cluster: A GKE cluster with two pools. One pool uses regular CPU machines. The second pool uses NVIDIA T4 GPUs.
  • NVIDIA device plugin: This plugin makes GPUs visible to Kubernetes. Pods can then ask for GPU resources.
  • WebRTC app: A Dockerizedcontainer service that asks for one GPU. It exposes two endpoints: 
    • /healthz to show if it is running
    • /metrics to share latency data
  • Prometheus and Grafana: These tools collect metrics and show them on dashboards. They let me see latency, GPU use, requests per second, and pod count.

  • Autoscaling: I kept a CPU-based HPA for basic scaling. But I also added KEDA (Kubernetes Event Drive Autoscaling), which scales pods based on Prometheus queries. This allowed me to scale on latency.

I also added a simple HTML dashboard built in JavaScript to simulate requests to the /process endpoint. It shows live p95 latency, RPS, and bar graphs for each mode: CPU, GPU, and autoscaling. This made it easy to demonstrate how latency changed with each mode — CPU vs. GPU.

This system ties scaling directly to what the user feels.

Diagram of and end-to-end setup for scaling WebRTC on GKE.

Figure 1: End-to-end setup for scaling WebRTC on GKE. A GPU-backed media service runs in a Kubernetes cluster. Prometheus, Grafana, and KEDA handle monitoring and autoscaling.


Startup and Teardown

Cloud resources cost money. GPU nodes are expensive, even when idle. I wanted a way to spin things up fast for testing and then shut them down when done.

I wrote two scripts:

startup.sh

This script sets up everything I need:

  • Creates a cluster with both CPU and GPU node pools
  • Installs the NVIDIA device plugin
  • Installs Prometheus, Grafana, and the DCGM exporter for GPU metrics
  • Deploys the WebRTC app in the rtc namespace

It only creates the GPU pool if I pass CREATE_GPU=1. This lets me test with CPU only when I don’t need GPUs. 

teardown.sh

This script cleans up:

  • Deletes the WebRTC app and namespace
  • Removes Prometheus and Grafana
  • Deletes GPU pools and the cluster
  • Checks for leftover load balancers, IPs, and disks

This step is critical. These scripts reduce setup time and protect against unexpected bills. 

Also, since the app runs in Docker, I could test the same image locally before pushing it to Artifact Registry and deploying it on GKE.

Interactive Demo and Visualization

To make the scaling results more tangible, I built a lightweight HTML dashboard.
It sends concurrent POST requests to the /process endpoint, measures round-trip time, and updates a live chart showing average and p95 latency.

The dashboard also includes:

  • A GPU/CPU mode indicator, showing where requests are being handled.
  • Concurrency and iteration controls to adjust live load.
  • A colored latency bar for quick visual feedback.

This helped simulate real-time stress tests in the browser and visualize how GPU acceleration and autoscaling improved performance. Below is the picture of the same.


Image of a lightweight HTML dashboard.

Deploying the WebRTC App

A simplified deployment looks like this:

YAML
 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-media
  namespace: rtc
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-media
  template:
    metadata:
      labels:
        app: gpu-media
    spec:
      containers:
      - name: gpu-media
        image: us-central1-docker.pkg.dev/webrtcscaling/containers/gpu-media:v1
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "1"
            memory: "2Gi"
            nvidia.com/gpu: 1
          limits:
            cpu: "2"
            memory: "4Gi"
            nvidia.com/gpu: 1

Image of Kubernetes objects for the WebRTC GPU service.

Figure 2: Kubernetes objects for the WebRTC GPU service. Prometheus scrapes metrics, Grafana displays dashboards, and the NVIDIA plugin makes GPU access possible.

Once the pod is running, I checked it with:

Shell
 
kubectl -n rtc port-forward svc/gpu-media 8080:80
curl -s http://localhost:8080/healthz


The result looked like this:

JSON
 
{"ok": true, "device": "cuda"}


This shows the pod is on a GPU node.

Monitoring Latency

CPU graphs are not enough. I needed to see what users would see. That means latency.

I used Prometheus to scrape the app’s /metrics endpoint. I also used the NVIDIA DCGM exporter to track GPU use. Grafana dashboards made it easy to view everything together.

Image shows how media flows through the GPU pod and how Prometheus scrapes latency metrics for monitoring.

Figure 3: How media flows through the GPU pod and how Prometheus scrapes latency metrics for monitoring.

My dashboard had four key panels:

  • p95 latency — 95% of requests were faster than this value.
  • Requests per second — showed the traffic level.
  • GPU use — checked how hard the GPU was working.
  • Pod replicas — confirmed scaling actions.

The main PromQL query for p95 latency:

PLSQL
 
histogram_quantile(0.95, sum(rate(app_request_latency_seconds_bucket[2m])) by (le))


With this, you can see latency spikes on Grafana panels. You can also watch GPU use and pod counts.

 I could see latency spikes before users complained.

Autoscaling on Latency

The real power came from autoscaling based on latency.

First, here’s a standard CPU-based HPA:

YAML
 
metrics:
- type: Resource
  resource:
    name: cpu
    target:
      type: Utilization
      averageUtilization: 60


This works, but only reacts when CPU usage is high. By that time, user calls may already be choppy.

With KEDA, I set a scaling rule on latency:

YAML
 
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: gpu-media-latency
  namespace: rtc
spec:
  scaleTargetRef:
    name: gpu-media
  minReplicaCount: 1
  maxReplicaCount: 5
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc:9090
      metricName: app_latency_p95_seconds
      query: |
        histogram_quantile(0.95, sum(rate(app_request_latency_seconds_bucket[2m])) by (le))
      threshold: "0.20"


Image of KEDA using a PromQL query on p95 latency to decide when to scale replicas up or down.

Figure 4: KEDA uses a PromQL query on p95 latency to decide when to scale replicas up or down


This means that if the p95 latency is above 200ms, scale out.

In tests, latency jumped past 200ms under load. KEDA added pods. Latency dropped back under the target. This was clear proof that scaling was tied to what users feel.


CPU vs GPU vs Autoscaling: Trade-offs in Real-Time Scaling

During testing, I compared three setups side by side — CPU-only, GPU-enabled, and Autoscaling (using KEDA with Prometheus).

Setup Description Performance Cost Best Use Case
CPU-only Default pods on standard GKE nodes Moderate latency (~400 ms under load) Low Small groups or low concurrency
GPU-enabled Uses NVIDIA T4 GPU node pool Stable latency (~150 ms even under load) High Heavy video workloads (encoding, filters, ML)
Autoscaling (KEDA) Scales based on latency or queue metrics Dynamic — grows only when needed Medium Burst traffic, variable loads

The takeaway:

  • CPU-only is predictable but limited.
  • GPU-enabled gives smooth real-time video at higher cost.
  • Autoscaling balances both by using extra pods or GPU nodes only when latency rises.

Lessons Learned

  • NodeSelector problems: Initially, my pods remained pending. The cause was a bad label. The right label is:
Shell
 
cloud.google.com/gke-accelerator: nvidia-tesla-t4


Image shows how the scheduler only places GPU workloads on nodes with the correct accelerator label.

Figure 5: The scheduler only places GPU workloads on nodes with the correct accelerator label.


Figure 5: The scheduler only places GPU workloads on nodes with the correct accelerator label.

Plugin URL: An old NVIDIA plugin link gave a 404. The correct one is:

Shell
 
https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/deployments/static/nvidia-device-plugin.yml


  • GPU quota: My project had no GPU quota at first. I had to request quota before the GPU pool worked. Also, GPU quota is regional; check per zone.
  • Costs: GPU nodes are pricey. Safe startup defaults and teardown scripts kept me from running them by accident.
  • Cluster operations overlap: GKE can block new node pool creation while another operation is running. Always check with gcloud container operations list before retrying.
  • Pending Pods Debugging: If pods stay pending, describe the Events: section — it’s often node affinity or GPU scheduling issues.
  • Startup/Teardown Testing: Test teardown thoroughly — a missed load balancer or disk can keep billing active.

Best Practices

Here’s what worked well for me:

  • Scale on latency SLOs, not just CPU.
  • Collect metrics that reflect what users feel: latency, jitter, frame drops.
  • Use Prometheus and Grafana for visibility.
  • Automate infra setup/teardown with scripts.
  • Check GPU quotas before you build.

Conclusion

Scaling WebRTC is not only about adding pods. It is about keeping calls smooth for users.

With Dockerized workloads, GKE, GPU nodes, Prometheus monitoring, and KEDA autoscaling, you can:

  • Scale on latency, not just CPU.
  • Keep user calls stable under load.
  • Control costs with startup and teardown scripts.

Users don’t care about system graphs. They care about video and audio that work. With this setup, scaling aligns with what matters most — the experience on the other end of the call.

WebRTC

Opinions expressed by DZone contributors are their own.

Related

  • Real-Time Communication Protocols: A Developer's Guide With JavaScript
  • Janus vs. MediaSoup: The Ultimate Guide To Choosing Your WebRTC Server
  • The Battle Of New Industry Standards: Is WebRTC Making Zoom Redundant?

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook