Last month I was helping a friend debug a transformer inference service that was somehow slower on an A100 than on his M2 MacBook for small batch sizes. The GPU utilization graph showed 30% usage. Thirty percent. On hardware that costs more per hour than my car payment.
If you've ever stared at nvidia-smi watching compute usage hover at 20% while your latency tanks, you've hit the same wall. And it's not your code being bad — it's that you're memory-bound, not compute-bound. Memory is where most ML workloads actually die now, and most of us were taught to optimize the wrong thing.
The frustrating symptom
Here's what it usually looks like. You profile your inference loop, see something like this, and immediately blame the model:
# A typical 'why is this slow' profile
import torch
import time
model = load_model().cuda().eval()
x = torch.randn(1, 512, 768, device='cuda')
# Warmup
for _ in range(10):
_ = model(x)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(100):
_ = model(x)
torch.cuda.synchronize()
print(f'{(time.perf_counter() - start) * 10:.2f}ms per call')You get, say, 18ms. You check nvidia-smi, GPU utilization is sitting at 25%. You add torch.compile. Maybe it shaves a millisecond. You try fp16. Same story. You start wondering if you bought a defective GPU.
You didn't. The compute is fine. The bottleneck moved.
Root cause: the arithmetic intensity wall
Every operation in your model has an arithmetic intensity — the ratio of FLOPs performed per byte of memory loaded. A matmul on a big square matrix has high intensity (lots of compute per byte). A LayerNorm or a residual add has terrible intensity (basically one op per byte).
Here's the part nobody told me until I read the NVIDIA performance docs: modern GPUs have way more compute than memory bandwidth. An H100 does about 67 TFLOPs of FP32 but only 3 TB/s of HBM bandwidth. That's a ratio of ~22 FLOPs per byte. If your op does less than 22 FLOPs per byte loaded, you are memory-bound, and adding more compute does literally nothing.
And this gap keeps widening. Memory hasn't been able to keep up with compute scaling for years now, which is why HBM stacks have eaten a larger and larger share of the BOM on AI accelerators. From a developer perspective, this means: small batch sizes, long sequences with KV caches, and elementwise ops are where you bleed.
The fix isn't 'use a smaller model.' The fix is making each byte you load do more work.
Step 1: Confirm you're actually memory-bound
Before optimizing anything, prove the diagnosis. Use the PyTorch profiler with stack traces:
from torch.profiler import profile, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
) as prof:
for _ in range(20):
_ = model(x)
# Sort by self CUDA time — the actual GPU work
print(prof.key_averages().table(
sort_by='self_cuda_time_total',
row_limit=15,
))Look at what's eating your time. If the top entries are aten::add, aten::mul, aten::layer_norm, or any flavor of pointwise op — congratulations, you're memory-bound. If they're aten::mm or aten::bmm, you're compute-bound and this article won't help much.
For a deeper view, nsys will show you the kernel launch overhead too, which on small batches is often a huge chunk of wall time. Tiny kernels means lots of round trips to global memory.
Step 2: Fuse your kernels
The single biggest win for memory-bound workloads is kernel fusion. Every separate kernel reads its inputs from HBM and writes outputs back to HBM. Five elementwise ops in a row means five round trips. Fuse them into one kernel and you do one round trip.
torch.compile does a lot of this automatically now, but you have to actually let it. The common mistake is recompiling on every call because shapes change:
# Bad — recompiles on every new shape
compiled = torch.compile(model)
for batch in dataloader:
out = compiled(batch) # shape changes -> recompile churn
# Better — mark the dynamic dim explicitly
compiled = torch.compile(model, dynamic=True)
# Or pad inputs to a small set of bucket sizes
def pad_to_bucket(x, buckets=(128, 256, 512, 1024)):
seq_len = x.shape[1]
target = next(b for b in buckets if b >= seq_len)
pad = target - seq_len
return torch.nn.functional.pad(x, (0, 0, 0, pad)), seq_lenI ran into this exact issue on a serving workload where each request had a different sequence length — we were paying compile cost on every single request. Bucketing input shapes cut p99 latency roughly in half. Not because the model got faster — because we stopped recompiling.
For attention specifically, use FlashAttention or its variants. It's the canonical example of a memory-bound op being rescued by tiling and recomputation — it never materializes the full attention matrix in HBM. Same math, dramatically less memory traffic.
Step 3: Shrink what you load
If you can't reduce trips, reduce bytes per trip. The obvious lever is quantization — int8 weights mean half the bytes loaded vs fp16, which directly translates to roughly 2x speedup on memory-bound layers. Libraries like bitsandbytes make weight-only quantization a one-liner for HuggingFace models.
For inference servers with KV caches, the cache itself often dwarfs the model. Look into paged attention (implemented in vLLM) and KV cache quantization. I haven't benchmarked KV quantization across enough model families to have a strong opinion on quality tradeoffs, but the throughput wins are real on memory-bound serving.
Step 4: Stop the silent allocations
This one is sneaky. Look for patterns like this:
# Allocates a new tensor every call
def forward_bad(self, x):
mask = torch.zeros_like(x)
mask[x > 0] = 1.0
return x * mask
# Reuses preallocated buffers, in-place where safe
def forward_good(self, x, mask_buf):
torch.gt(x, 0, out=mask_buf)
return x.mul_(mask_buf) # in-place mulThe first version allocates, writes, then multiplies — three memory passes. The second is one pass. Multiply that across every layer and you've doubled your memory traffic for no reason.
Prevention: budget memory traffic, not just FLOPs
The mental shift that helped me most: when designing or reviewing an ML pipeline, estimate the bytes moved, not just the FLOPs computed. For each op, ask 'how many FLOPs per byte?' If it's below your hardware's compute-to-bandwidth ratio, that op is memory-bound and no amount of compute optimization will save it.
A few habits that pay off:
- Profile before optimizing. The first time I 'optimized' a transformer I spent two days rewriting matmuls that weren't the bottleneck.
- Prefer fused implementations of attention and normalization layers — they exist for a reason.
- Bucket dynamic shapes for compiled models.
- Watch your activation memory during training; gradient checkpointing trades compute for memory and that's often the right trade now.
- Read Horace He's 'Making Deep Learning Go Brrrr From First Principles'. It's the clearest explanation of this whole topic I've found.
The annoying truth is that for most of us, ML performance work in 2026 is mostly memory work. Compute is cheap, bandwidth is precious, and the gap is only getting wider. Plan accordingly.
