DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Feature Flag Debt: Performance Impact in Enterprise Applications
  • Why Image Optimization in Modern Applications Matters More Than You Think
  • AI-Based Multi-Cloud Cost and Resource Optimization
  • MinIO AIStor and Ampere® Computing Reference Architecture for High-Performance AI Inference

Trending

  • AI Agents Expose a Design Gap in Microservices Resilience Architecture
  • You Learned AI. So Why Are You Still Not Getting Hired?
  • Stop Using the ATM-Didn’t-Kill-Jobs Story to Reassure Developers About AI
  • Context-Aware Authorization for AI Agents
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Building Green AI: Lessons in GPU Efficiency From the Trenches

Building Green AI: Lessons in GPU Efficiency From the Trenches

Most deep learning models today waste a lot of compute and energy — not because the GPUs are slow, but because we’re not feeding them efficiently.

By 
Sanjay Kurra user avatar
Sanjay Kurra
·
Updated by 
Srikanta Datta Tumkur user avatar
Srikanta Datta Tumkur
·
Dec. 03, 25 · Analysis
Likes (0)
Comment
Save
Tweet
Share
2.4K Views

Join the DZone community and get the full member experience.

Join For Free

The Real Problem With Modern Deep Learning

Let’s be honest — we all love scaling. Bigger models, more GPUs, larger clusters. But here’s what we found in production: most GPU time isn’t spent doing useful work.

Even when the utilization graph says “busy,” your GPUs might be sitting idle waiting for data. The issue isn’t the hardware — it’s inefficiency across three fronts:

  1. Compute-bound limits – when your math throughput (FLOPs) hits the ceiling
  2. Memory-bound stalls – when GPUs wait on data transfer
  3. Overhead – when performance evaporates in kernel launches, data shuffling, or CPU-GPU syncs

Over the last few years, we started treating these not as isolated optimization points, but as a systems problem. That mindset shift led to what I now call the Holistic Efficiency-Centric (HEC) framework — a practical way to build “green AI” that’s both fast and resource-conscious.

Designing for Compute-Bound Efficiency

The first principle is simple: keep GPUs doing math, not waiting for memory.

I started focusing on arithmetic intensity — basically, the ratio of useful computation (FLOPs) to data movement (bytes). High intensity means efficient GPU use.

Mathematica
 
AI = FLOPs / Bytes


Here’s what worked best in practice:

The higher this number, the better your compute utilization. Here’s what works in practice:

  • Structured pruning over unstructured. Removing entire neurons, filters, or heads keeps memory access patterns contiguous, which GPUs love.
  • Architectures designed for FLOPs/Byte — think EfficientNet or sparse Mixture-of-Experts (MoE) layers that activate only a subset of experts per input.
  • Depthwise separable convolutions are underrated — they perform more work per byte moved.

When we profile networks, we often find it’s not compute that’s slow — it’s inefficient memory access. Designing for arithmetic intensity fixes that from the ground up.

In short: optimize architectures for data flow, not just parameter count.

Making Memory Your Ally: Quantization and Mixed Precision

Once the model is compute-efficient, the next step is cutting the data transfer bottleneck.

Two techniques that made a noticeable impact in our experiments:

  • Quantization-aware training (QAT). Instead of training in full precision and hoping post-training quantization works, I now train directly with lower precision (like INT8). This forces the model to adapt early, maintaining accuracy while slashing memory use.
  • Mixed-precision training. Using FP16/FP32 hybrids isn’t just about speedups — it’s energy efficiency, too. On modern NVIDIA GPUs, mixed-precision kernels can double throughput while lowering power draw.

These tweaks alone often reduced training time by 30–40% while dropping power consumption per epoch. That’s tangible sustainability.

Killing Overhead With Smarter Inference

If training is about efficiency over time, inference is about throughput per watt. The biggest culprit I see is kernel overhead — small operations constantly launching new CUDA kernels. Most go-to strategies:

  • Operator fusion: Tools like TensorRT or ONNX Runtime can merge multiple small ops (e.g., conv + activation + norm) into a single kernel. This reduces launch latency and memory swaps.
  • Tensor tiling: Align matrix operations to fit into shared GPU memory (SRAM). The data stays local longer, cutting HBM traffic.
  • Dynamic batching: For production inference servers, grouping requests dynamically (using Triton or custom batching) increases GPU occupancy without adding much latency.

The difference in throughput can be massive — I have seen up to 3x gains from these optimizations alone.

Measuring What Actually Matters

Now comes the part many teams skip: measuring efficiency beyond accuracy and latency. The lesson the hard way — after weeks of guessing which tweaks mattered. The fix? Real-time GPU telemetry. Our standard MLOps observability stack looks like this:

  • NVIDIA DCGM Exporter – exposes GPU metrics like utilization, memory bandwidth, and power draw.
  • Prometheus – scrapes and stores these metrics at fine intervals.
  • Grafana – dashboards for live and historical visibility.

A typical setup in Kubernetes:

  • Deploy DCGM Exporter as a DaemonSet (so it runs on every GPU node).
  • Configure a ServiceMonitor for Prometheus to scrape metrics.
  • Wire Grafana to visualize trends across clusters.

1. Deploying the GPU Monitoring Stack

Step 1: DCGM Exporter (GPU Telemetry Source)

The NVIDIA DCGM Exporter is a lightweight agent that publishes GPU metrics like utilization, memory bandwidth, and power usage. In Kubernetes, I deploy it as a DaemonSet, ensuring one exporter per GPU node:

YAML
 
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      containers:
        - name: dcgm-exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:3.2.0-ubuntu20.04
          ports:
            - containerPort: 9400
              name: metrics


Step 2: Prometheus (Metrics Collection)

Prometheus acts as the time-series database. A ServiceMonitor automatically finds and scrapes all DCGM exporters in the cluster:

YAML
 
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-servicemonitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s


Grafana connects to Prometheus and makes all of this human-readable. My dashboard tracks these key metrics:

  • GPU utilization (how busy the SMs are)
  • PCIe and NVLink bandwidth (data movement rate)
  • SM occupancy and memory copy utilization (overhead)
  • Power usage (energy cost per model)

Here are a few PromQL snippets I rely on every day:

JSON
 
# Average GPU utilization (Compute-Bound Indicator)
avg by (node, gpu) (dcgm_gpu_utilization)

# PCIe data transmitted per second (MB/s)
rate(dcgm_pcie_tx_bytes[1m]) * 1024

# NVLink throughput (GB/s)
rate(dcgm_nvlink_tx_bytes[1m]) * 1024 * 1024

# Average power usage (Watts)
avg by (node, gpu) (dcgm_gpu_power_usage)


Sample Grafana dashboard JSON to import for visualization: 

JSON
 
{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "gnetId": null,
  "graphTooltip": 1,
  "id": 1,
  "links": [],
  "panels": [
    {
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "max": 100,
          "min": 0,
          "unit": "percent"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 0,
        "y": 0
      },
      "id": 2,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "showThresholdLabels": false,
        "showThresholdMarkers": true
      },
      "targets": [
        {
          "exemplar": false,
          "expr": "avg by (node, gpu) (dcgm_gpu_utilization)",
          "hide": false,
          "interval": "",
          "legendFormat": "{{node}} (GPU {{gpu}})",
          "promql": true,
          "refId": "A"
        }
      ],
      "title": "GPU Utilization (%) - Compute-Bound Indicator",
      "type": "gauge"
    },
    {
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "max": 100,
          "min": 0,
          "unit": "percent"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 8,
        "y": 0
      },
      "id": 4,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "showThresholdLabels": false,
        "showThresholdMarkers": true
      },
      "targets": [
        {
          "exemplar": false,
          "expr": "avg by (node, gpu) (dcgm_gpu_mem_copy_util)",
          "hide": false,
          "interval": "",
          "legendFormat": "{{node}} (GPU {{gpu}})",
          "promql": true,
          "refId": "A"
        }
      ],
      "title": "GPU Memory Copy Utilization (%)",
      "type": "gauge"
    },
    {
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "max": 100,
          "min": 0,
          "unit": "watt"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 16,
        "y": 0
      },
      "id": 6,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "showThresholdLabels": false,
        "showThresholdMarkers": true
      },
      "targets": [
        {
          "exemplar": false,
          "expr": "avg by (node, gpu) (dcgm_gpu_power_usage)",
          "hide": false,
          "interval": "",
          "legendFormat": "{{node}} (GPU {{gpu}})",
          "promql": true,
          "refId": "A"
        }
      ],
      "title": "GPU Power Usage (Watts)",
      "type": "gauge"
    },
    {
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "unit": "percent"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 9,
        "w": 24,
        "x": 0,
        "y": 8
      },
      "id": 8,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "right"
        },
        "tooltip": {
          "mode": "multi-time"
        }
      },
      "targets": [
        {
          "exemplar": false,
          "expr": "avg by (node, gpu) (dcgm_gpu_utilization)",
          "legendFormat": "GPU {{gpu}} - {{node}}",
          "promql": true,
          "refId": "A"
        }
      ],
      "title": "GPU Utilization Over Time",
      "type": "timeseries"
    },
    {
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "unit": "bytes"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 9,
        "w": 24,
        "x": 0,
        "y": 17
      },
      "id": 10,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "right"
        },
        "tooltip": {
          "mode": "multi-time"
        }
      },
      "targets": [
        {
          "exemplar": false,
          "expr": "avg by (node, gpu) (dcgm_gpu_dram_read_util) + avg by (node, gpu) (dcgm_gpu_dram_write_util)",
          "legendFormat": "DRAM Utilization {{gpu}} - {{node}}",
          "promql": true,
          "refId": "A"
        },
        {
          "exemplar": false,
          "expr": "rate(dcgm_pcie_tx_bytes[1m])",
          "legendFormat": "PCIe TX {{gpu}} - {{node}}",
          "promql": true,
          "refId": "B"
        },
        {
          "exemplar": false,
          "expr": "rate(dcgm_nvlink_tx_bytes[1m])",
          "legendFormat": "NVLink TX {{gpu}} - {{node}}",
          "promql": true,
          "refId": "C"
        }
      ],
      "title": "Memory/Interconnect Bandwidth (Bytes/sec)",
      "type": "timeseries"
    },
    {
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "unit": "watt"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 9,
        "w": 24,
        "x": 0,
        "y": 26
      },
      "id": 12,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "right"
        },
        "tooltip": {
          "mode": "multi-time"
        }
      },
      "targets": [
        {
          "exemplar": false,
          "expr": "avg by (node, gpu) (dcgm_gpu_power_usage)",
          "legendFormat": "GPU Power {{gpu}} - {{node}}",
          "promql": true,
          "refId": "A"
        }
      ],
      "title": "GPU Power Usage Over Time",
      "type": "timeseries"
    }
  ],
  "refresh": "5s",
  "schemaVersion": 27,
  "style": "dark",
  "tags": [
    "gpu",
    "nvidia",
    "mlops"
  ],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-30m",
    "to": "now"
  },
  "timepicker": {
    "refresh_intervals": [
      "5s",
      "10s",
      "30s",
      "1m",
      "5m",
      "15m",
      "30m",
      "1h",
      "2h",
      "1d"
    ],
    "time_options": [
      "5m",
      "15m",
      "1h",
      "6h",
      "12h",
      "24h",
      "2d",
      "7d",
      "30d"
    ]
  },
  "timezone": "",
  "title": "DCGM GPU Monitoring Dashboard",
  "uid": "dcgm-gpu-monitoring-1",
  "version": 1
}


These metrics expose patterns that aggregate numbers hide. For instance:

  • High GPU utilization + high power = healthy, compute-bound.
  • Low GPU utilization + high PCIe bandwidth = memory-bound.
  • Low SM occupancy = overhead (too many small kernels).

Once you can see it, you can fix it.

When Dashboards Aren’t Enough: Deep Profiling

Metrics are great, but when something still feels “off,”  deeper using NVIDIA’s profiling tools:

  • Nsight systems: shows full workload timelines — kernel launches, memory transfers, CPU-GPU syncs.
  • Nsight compute: dives into per-kernel metrics like cache hits, warp occupancy, and memory efficiency.

One lesson here: don’t rely on single snapshots. Run profiling with your real batch sizes and data pipelines. Bottlenecks that don’t show up in synthetic tests often explode under production load.

Lessons Learned: Efficiency Is an Engineering Discipline

Here’s what we have learned after multiple iterations of tuning, profiling, and staring at Grafana dashboards at 2 a.m.:

  1. Don’t chase FLOPs — chase utilization. Most “slow models” aren’t compute-limited; they’re starved for data or blocked by launch overhead.
  2. Think system-wide. Model design, kernel fusion, quantization — they all interact. Treat optimization as a pipeline, not a set of toggles.
  3. Make observability part of training. GPU telemetry should be standard in every ML stack, not an afterthought.
  4. Measure energy, not just speed. A 10% slower model that uses 30% less power is often the better production choice.

The Future: Self-Tuning AI Pipelines

Right now, efficiency tuning is still mostly manual — a mix of profiling, trial, and “engineer instinct.” But we can see the future moving toward adaptive systems that self-optimize based on telemetry.

Imagine your training cluster dynamically adjusting precision, batch size, or kernel fusion based on live GPU metrics — not static configs. Tools such as DCGM, Nsight, and Triton Inference Server already expose the data. It’s just a matter of closing the loop.

That’s what I mean by “Green AI”: not just fewer parameters, but smarter systems that respect both performance and the planet.

Final Thoughts

When I started working on this, I thought of sustainability as an environmental checkbox. Now I see it as an engineering superpower — efficient systems are faster, cheaper, and greener all at once.

The Holistic Efficiency-Centric mindset changed how I build models: from tuning loss functions to tuning data paths. And it’s deeply satisfying to see a Grafana dashboard where every GPU is humming at 95% utilization — not from brute force, but from good engineering.

If you’re optimizing deep learning workloads, start here:

  • Profile everything
  • Measure utilization and power, not just accuracy
  • Tune your pipeline until your GPUs spend their time computing, not waiting

That’s where Green AI begins — in the trenches, one kernel at a time.

AI Deep learning Performance

Opinions expressed by DZone contributors are their own.

Related

  • Feature Flag Debt: Performance Impact in Enterprise Applications
  • Why Image Optimization in Modern Applications Matters More Than You Think
  • AI-Based Multi-Cloud Cost and Resource Optimization
  • MinIO AIStor and Ampere® Computing Reference Architecture for High-Performance AI Inference

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook