Building Green AI: Lessons in GPU Efficiency From the Trenches

Most deep learning models today waste a lot of compute and energy — not because the GPUs are slow, but because we’re not feeding them efficiently.

Sanjay Kurra

Updated by

Srikanta Datta Tumkur

Dec. 03, 25 · Analysis

Likes (0)

Comment

Save

2.7K Views

The Real Problem With Modern Deep Learning

Let’s be honest — we all love scaling. Bigger models, more GPUs, larger clusters. But here’s what we found in production: most GPU time isn’t spent doing useful work.

Even when the utilization graph says “busy,” your GPUs might be sitting idle waiting for data. The issue isn’t the hardware — it’s inefficiency across three fronts:

Compute-bound limits – when your math throughput (FLOPs) hits the ceiling
Memory-bound stalls – when GPUs wait on data transfer
Overhead – when performance evaporates in kernel launches, data shuffling, or CPU-GPU syncs

Over the last few years, we started treating these not as isolated optimization points, but as a systems problem. That mindset shift led to what I now call the Holistic Efficiency-Centric (HEC) framework — a practical way to build “green AI” that’s both fast and resource-conscious.

Designing for Compute-Bound Efficiency

The first principle is simple: keep GPUs doing math, not waiting for memory.

I started focusing on arithmetic intensity — basically, the ratio of useful computation (FLOPs) to data movement (bytes). High intensity means efficient GPU use.

    Mathematica
   
   AI = FLOPs / Bytes

Here’s what worked best in practice:

The higher this number, the better your compute utilization. Here’s what works in practice:

Structured pruning over unstructured. Removing entire neurons, filters, or heads keeps memory access patterns contiguous, which GPUs love.
Architectures designed for FLOPs/Byte — think EfficientNet or sparse Mixture-of-Experts (MoE) layers that activate only a subset of experts per input.
Depthwise separable convolutions are underrated — they perform more work per byte moved.

When we profile networks, we often find it’s not compute that’s slow — it’s inefficient memory access. Designing for arithmetic intensity fixes that from the ground up.

In short: optimize architectures for data flow, not just parameter count.

Making Memory Your Ally: Quantization and Mixed Precision

Once the model is compute-efficient, the next step is cutting the data transfer bottleneck.

Two techniques that made a noticeable impact in our experiments:

Quantization-aware training (QAT). Instead of training in full precision and hoping post-training quantization works, I now train directly with lower precision (like INT8). This forces the model to adapt early, maintaining accuracy while slashing memory use.
Mixed-precision training. Using FP16/FP32 hybrids isn’t just about speedups — it’s energy efficiency, too. On modern NVIDIA GPUs, mixed-precision kernels can double throughput while lowering power draw.

These tweaks alone often reduced training time by 30–40% while dropping power consumption per epoch. That’s tangible sustainability.

Killing Overhead With Smarter Inference

If training is about efficiency over time, inference is about throughput per watt. The biggest culprit I see is kernel overhead — small operations constantly launching new CUDA kernels. Most go-to strategies:

Operator fusion: Tools like TensorRT or ONNX Runtime can merge multiple small ops (e.g., conv + activation + norm) into a single kernel. This reduces launch latency and memory swaps.
Tensor tiling: Align matrix operations to fit into shared GPU memory (SRAM). The data stays local longer, cutting HBM traffic.
Dynamic batching: For production inference servers, grouping requests dynamically (using Triton or custom batching) increases GPU occupancy without adding much latency.

The difference in throughput can be massive — I have seen up to 3x gains from these optimizations alone.

Measuring What Actually Matters

Now comes the part many teams skip: measuring efficiency beyond accuracy and latency. The lesson the hard way — after weeks of guessing which tweaks mattered. The fix? Real-time GPU telemetry. Our standard MLOps observability stack looks like this:

NVIDIA DCGM Exporter – exposes GPU metrics like utilization, memory bandwidth, and power draw.
Prometheus – scrapes and stores these metrics at fine intervals.
Grafana – dashboards for live and historical visibility.

A typical setup in Kubernetes:

Deploy DCGM Exporter as a DaemonSet (so it runs on every GPU node).
Configure a ServiceMonitor for Prometheus to scrape metrics.
Wire Grafana to visualize trends across clusters.

1. Deploying the GPU Monitoring Stack

Step 1: DCGM Exporter (GPU Telemetry Source)

The NVIDIA DCGM Exporter is a lightweight agent that publishes GPU metrics like utilization, memory bandwidth, and power usage. In Kubernetes, I deploy it as a DaemonSet, ensuring one exporter per GPU node:

    YAML
   
 

   apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      containers:
        - name: dcgm-exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:3.2.0-ubuntu20.04
          ports:
            - containerPort: 9400
              name: metrics

  

Step 2: Prometheus (Metrics Collection)

Prometheus acts as the time-series database. A ServiceMonitor automatically finds and scrapes all DCGM exporters in the cluster:

    YAML
   
 

   apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-servicemonitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s

  

Grafana connects to Prometheus and makes all of this human-readable. My dashboard tracks these key metrics:

GPU utilization (how busy the SMs are)
PCIe and NVLink bandwidth (data movement rate)
SM occupancy and memory copy utilization (overhead)
Power usage (energy cost per model)

Here are a few PromQL snippets I rely on every day:

    JSON
   
   # Average GPU utilization (Compute-Bound Indicator)
avg by (node, gpu) (dcgm_gpu_utilization)

# PCIe data transmitted per second (MB/s)
rate(dcgm_pcie_tx_bytes[1m]) * 1024

# NVLink throughput (GB/s)
rate(dcgm_nvlink_tx_bytes[1m]) * 1024 * 1024

# Average power usage (Watts)
avg by (node, gpu) (dcgm_gpu_power_usage)

Sample Grafana dashboard JSON to import for visualization:

    JSON
   
 

   {
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "gnetId": null,
  "graphTooltip": 1,
  "id": 1,
  "links": [],
  "panels": [
    {
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "max": 100,
          "min": 0,
          "unit": "percent"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 0,
        "y": 0
      },
      "id": 2,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "showThresholdLabels": false,
        "showThresholdMarkers": true
      },
      "targets": [
        {
          "exemplar": false,
          "expr": "avg by (node, gpu) (dcgm_gpu_utilization)",
          "hide": false,
          "interval": "",
          "legendFormat": "{{node}} (GPU {{gpu}})",
          "promql": true,
          "refId": "A"
        }
      ],
      "title": "GPU Utilization (%) - Compute-Bound Indicator",
      "type": "gauge"
    },
    {
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "max": 100,
          "min": 0,
          "unit": "percent"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 8,
        "y": 0
      },
      "id": 4,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "showThresholdLabels": false,
        "showThresholdMarkers": true
      },
      "targets": [
        {
          "exemplar": false,
          "expr": "avg by (node, gpu) (dcgm_gpu_mem_copy_util)",
          "hide": false,
          "interval": "",
          "legendFormat": "{{node}} (GPU {{gpu}})",
          "promql": true,
          "refId": "A"
        }
      ],
      "title": "GPU Memory Copy Utilization (%)",
      "type": "gauge"
    },
    {
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "max": 100,
          "min": 0,
          "unit": "watt"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 8,
        "x": 16,
        "y": 0
      },
      "id": 6,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "showThresholdLabels": false,
        "showThresholdMarkers": true
      },
      "targets": [
        {
          "exemplar": false,
          "expr": "avg by (node, gpu) (dcgm_gpu_power_usage)",
          "hide": false,
          "interval": "",
          "legendFormat": "{{node}} (GPU {{gpu}})",
          "promql": true,
          "refId": "A"
        }
      ],
      "title": "GPU Power Usage (Watts)",
      "type": "gauge"
    },
    {
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "unit": "percent"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 9,
        "w": 24,
        "x": 0,
        "y": 8
      },
      "id": 8,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "right"
        },
        "tooltip": {
          "mode": "multi-time"
        }
      },
      "targets": [
        {
          "exemplar": false,
          "expr": "avg by (node, gpu) (dcgm_gpu_utilization)",
          "legendFormat": "GPU {{gpu}} - {{node}}",
          "promql": true,
          "refId": "A"
        }
      ],
      "title": "GPU Utilization Over Time",
      "type": "timeseries"
    },
    {
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "unit": "bytes"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 9,
        "w": 24,
        "x": 0,
        "y": 17
      },
      "id": 10,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "right"
        },
        "tooltip": {
          "mode": "multi-time"
        }
      },
      "targets": [
        {
          "exemplar": false,
          "expr": "avg by (node, gpu) (dcgm_gpu_dram_read_util) + avg by (node, gpu) (dcgm_gpu_dram_write_util)",
          "legendFormat": "DRAM Utilization {{gpu}} - {{node}}",
          "promql": true,
          "refId": "A"
        },
        {
          "exemplar": false,
          "expr": "rate(dcgm_pcie_tx_bytes[1m])",
          "legendFormat": "PCIe TX {{gpu}} - {{node}}",
          "promql": true,
          "refId": "B"
        },
        {
          "exemplar": false,
          "expr": "rate(dcgm_nvlink_tx_bytes[1m])",
          "legendFormat": "NVLink TX {{gpu}} - {{node}}",
          "promql": true,
          "refId": "C"
        }
      ],
      "title": "Memory/Interconnect Bandwidth (Bytes/sec)",
      "type": "timeseries"
    },
    {
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "custom": {},
          "unit": "watt"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 9,
        "w": 24,
        "x": 0,
        "y": 26
      },
      "id": 12,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "right"
        },
        "tooltip": {
          "mode": "multi-time"
        }
      },
      "targets": [
        {
          "exemplar": false,
          "expr": "avg by (node, gpu) (dcgm_gpu_power_usage)",
          "legendFormat": "GPU Power {{gpu}} - {{node}}",
          "promql": true,
          "refId": "A"
        }
      ],
      "title": "GPU Power Usage Over Time",
      "type": "timeseries"
    }
  ],
  "refresh": "5s",
  "schemaVersion": 27,
  "style": "dark",
  "tags": [
    "gpu",
    "nvidia",
    "mlops"
  ],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-30m",
    "to": "now"
  },
  "timepicker": {
    "refresh_intervals": [
      "5s",
      "10s",
      "30s",
      "1m",
      "5m",
      "15m",
      "30m",
      "1h",
      "2h",
      "1d"
    ],
    "time_options": [
      "5m",
      "15m",
      "1h",
      "6h",
      "12h",
      "24h",
      "2d",
      "7d",
      "30d"
    ]
  },
  "timezone": "",
  "title": "DCGM GPU Monitoring Dashboard",
  "uid": "dcgm-gpu-monitoring-1",
  "version": 1
}

  

These metrics expose patterns that aggregate numbers hide. For instance:

High GPU utilization + high power = healthy, compute-bound.
Low GPU utilization + high PCIe bandwidth = memory-bound.
Low SM occupancy = overhead (too many small kernels).

Once you can see it, you can fix it.

When Dashboards Aren’t Enough: Deep Profiling

Metrics are great, but when something still feels “off,” deeper using NVIDIA’s profiling tools:

Nsight systems: shows full workload timelines — kernel launches, memory transfers, CPU-GPU syncs.
Nsight compute: dives into per-kernel metrics like cache hits, warp occupancy, and memory efficiency.

One lesson here: don’t rely on single snapshots. Run profiling with your real batch sizes and data pipelines. Bottlenecks that don’t show up in synthetic tests often explode under production load.

Lessons Learned: Efficiency Is an Engineering Discipline

Here’s what we have learned after multiple iterations of tuning, profiling, and staring at Grafana dashboards at 2 a.m.:

Don’t chase FLOPs — chase utilization. Most “slow models” aren’t compute-limited; they’re starved for data or blocked by launch overhead.
Think system-wide. Model design, kernel fusion, quantization — they all interact. Treat optimization as a pipeline, not a set of toggles.
Make observability part of training. GPU telemetry should be standard in every ML stack, not an afterthought.
Measure energy, not just speed. A 10% slower model that uses 30% less power is often the better production choice.

The Future: Self-Tuning AI Pipelines

Right now, efficiency tuning is still mostly manual — a mix of profiling, trial, and “engineer instinct.” But we can see the future moving toward adaptive systems that self-optimize based on telemetry.

Imagine your training cluster dynamically adjusting precision, batch size, or kernel fusion based on live GPU metrics — not static configs. Tools such as DCGM, Nsight, and Triton Inference Server already expose the data. It’s just a matter of closing the loop.

That’s what I mean by “Green AI”: not just fewer parameters, but smarter systems that respect both performance and the planet.

Final Thoughts

When I started working on this, I thought of sustainability as an environmental checkbox. Now I see it as an engineering superpower — efficient systems are faster, cheaper, and greener all at once.

The Holistic Efficiency-Centric mindset changed how I build models: from tuning loss functions to tuning data paths. And it’s deeply satisfying to see a Grafana dashboard where every GPU is humming at 95% utilization — not from brute force, but from good engineering.

If you’re optimizing deep learning workloads, start here:

Profile everything
Measure utilization and power, not just accuracy
Tune your pipeline until your GPUs spend their time computing, not waiting

That’s where Green AI begins — in the trenches, one kernel at a time.

AI Deep learning Performance

Opinions expressed by DZone contributors are their own.

Related

Trending