AuthonAuthon Blog
debugging6 min read

Why your 27B model won't fit on 24GB VRAM (and how to actually fix it)

Why 4-bit 27B models still OOM on 24GB cards, and the quant + KV cache + backend settings that actually let them fit.

AW
Alan West
Authon Team
Why your 27B model won't fit on 24GB VRAM (and how to actually fix it)

The frustration is familiar. You finally pull down a 27B-parameter model, fire up your loader, and watch CUDA throw an out-of-memory error before the prompt even renders. Your 24GB card, the same one that handled 13B models without breaking a sweat, now refuses to cooperate.

I hit this wall again last month while testing some of the newer Qwen 27B-class checkpoints on a single 3090. Spent the better part of a weekend benchmarking llama.cpp, ik_llama.cpp, and vLLM trying to find a sane combination of quantization and runtime flags. Here's what actually works, and more importantly, why the obvious settings fail.

Why 27B "should" fit but doesn't

Quick math: a 27B model at FP16 is roughly 54GB. At INT8 it's around 27GB. At 4-bit it lands somewhere between 14 and 16GB depending on the quant scheme. So on paper, a 4-bit quant should leave you with 8GB of headroom on a 24GB card.

That headroom is a lie. Three things eat it:

  • KV cache: grows linearly with context length. At 8K tokens on a 27B model, you're looking at 2-4GB depending on group-query attention settings.
  • Activation memory: transient but spiky, especially during prefill.
  • Runtime bookkeeping: vLLM in particular reserves a chunk for its paged attention pool.

So the question isn't "does the model fit." It's "does the model plus your working context plus your runtime's bookkeeping fit."

Picking the right quant

Not all 4-bit quants are equal. Here's the practical hierarchy I've settled on for 24GB cards.

Q4_K_M vs IQ4_XS vs Q5_K_M

For GGUF specifically:

  • Q4_K_M: solid default. Around 16GB for a 27B model. Quality holds up for most tasks.
  • IQ4_XS: smaller (~14GB), uses importance-matrix calibration. Slightly slower to evaluate but frees up room for larger context.
  • Q5_K_M: noticeably better quality but pushes ~19GB, which leaves you fighting for KV cache budget.

I ran the same 50-prompt eval set across all three on a 27B model and IQ4_XS lost about 2% on reasoning compared to Q5_K_M. For chat workloads that gap basically disappears. For code generation, it widens.

Backend showdown

This is where the weekend went. Same model, same quant, several runtimes.

llama.cpp

The reliable baseline. CPU offload works, partial GPU layers work, KV cache quantization works. Build with CUDA support:

bash
# Build with CUDA backend enabled
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

Then run with enough layers offloaded to fill VRAM:

bash
./build/bin/llama-server \
  -m qwen-27b-iq4xs.gguf \
  -ngl 99 \               # offload all layers we can
  -c 8192 \               # context size
  -ctk q8_0 -ctv q8_0 \   # quantize KV cache to 8-bit
  --flash-attn

The -ctk and -ctv flags are the unsung heroes. Quantizing the KV cache to Q8_0 roughly halves cache memory with negligible quality loss. Skip this and you'll hit OOM around 4K context. With it, 16K is comfortable on a 24GB card. Docs at github.com/ggml-org/llama.cpp.

ik_llama.cpp

A fork that's been pushing more aggressive quantization formats and CPU-side optimizations. The IQK-family quants it ships are sharper than vanilla IQ4_XS at similar sizes. For pure GPU inference the speedup over upstream llama.cpp is modest, maybe 10-15% on my tests. For mixed CPU/GPU it's larger.

I haven't tested every quant format thoroughly, but the IQ4_KS variant gave me the best quality-per-byte on a 27B model at the cost of slightly more complex setup. Worth checking the project's repo for current quant recommendations since this space moves fast.

vLLM

vLLM is the speed king for batched inference, but on a single 24GB card with a 27B model you fight for every megabyte. The catch: vLLM doesn't load GGUF the same way llama.cpp does. You either need an AWQ or GPTQ quant of the model, or accept the FP16 weights won't fit at all.

A config that actually loads:

bash
# Using AWQ 4-bit weights
python -m vllm.entrypoints.openai.api_server \
  --model qwen-27b-awq \
  --quantization awq \
  --gpu-memory-utilization 0.92 \
  --max-model-len 8192 \
  --kv-cache-dtype fp8

The --gpu-memory-utilization flag is critical. Default 0.9 sometimes works, sometimes OOMs depending on your driver state. I drop it to 0.88 if I want to keep a browser open on the same GPU. See docs.vllm.ai for the full flag list.

For single-user latency, vLLM is fast but not dramatically faster than llama.cpp with flash attention. Where it pulls ahead is concurrent requests — paged attention shines with batching.

The order of operations that actually works

After enough trial and error, this is the checklist I run whenever I'm fitting a new 27B-class model on 24GB:

  • Start with a Q4_K_M or IQ4_XS GGUF for sanity-check loading.
  • Enable flash attention (--flash-attn in llama.cpp).
  • Quantize the KV cache to Q8_0 or FP8 — non-negotiable above 4K context.
  • Set context length explicitly. Don't let it default to whatever the model card says; you'll OOM.
  • Watch nvidia-smi during prefill, not just generation. Activation spikes are real.
  • Only after that baseline works do I start swapping in larger quants or higher context windows.

    What I'd avoid

    A few things that wasted my time so you don't have to:

    • Don't trust the model's advertised context length on a single 24GB card. A checkpoint trained at 128K won't fit 128K of KV cache. Cap it.
    • Don't load FP16 with --load-format auto in vLLM. It'll try and crash. Be explicit about quantization.
    • Don't mix CPU offload with batched serving. The latency variance is brutal.
    • Don't benchmark with a cold cache. Run a warmup pass before measuring tokens/sec.

    Prevention: monitor before you launch

    The single most useful habit I picked up: keep a terminal with watch -n 0.5 nvidia-smi open while I'm tuning. You'll see the moment a config is about to OOM versus when it's settling. The temptation is to keep cranking context length until it crashes — resist that and instead watch VRAM converge during a real prefill.

    Most of the OOMs I've seen aren't about the model being too big. They're about someone setting context length aspirationally and forgetting the KV cache scales with it. Once you internalize that the model weights are the floor and everything else stacks on top, the tuning gets much faster.

    I haven't tested every backend in this space thoroughly, but the principles above hold across runtimes. Quant choice, KV cache quantization, and explicit context limits are the three knobs that matter most.

    Why your 27B model won't fit on 24GB VRAM (and how to actually fix it) | Authon Blog