The frustration is familiar. You finally pull down a 27B-parameter model, fire up your loader, and watch CUDA throw an out-of-memory error before the prompt even renders. Your 24GB card, the same one that handled 13B models without breaking a sweat, now refuses to cooperate.
I hit this wall again last month while testing some of the newer Qwen 27B-class checkpoints on a single 3090. Spent the better part of a weekend benchmarking llama.cpp, ik_llama.cpp, and vLLM trying to find a sane combination of quantization and runtime flags. Here's what actually works, and more importantly, why the obvious settings fail.
Why 27B "should" fit but doesn't
Quick math: a 27B model at FP16 is roughly 54GB. At INT8 it's around 27GB. At 4-bit it lands somewhere between 14 and 16GB depending on the quant scheme. So on paper, a 4-bit quant should leave you with 8GB of headroom on a 24GB card.
That headroom is a lie. Three things eat it:
- KV cache: grows linearly with context length. At 8K tokens on a 27B model, you're looking at 2-4GB depending on group-query attention settings.
- Activation memory: transient but spiky, especially during prefill.
- Runtime bookkeeping: vLLM in particular reserves a chunk for its paged attention pool.
So the question isn't "does the model fit." It's "does the model plus your working context plus your runtime's bookkeeping fit."
Picking the right quant
Not all 4-bit quants are equal. Here's the practical hierarchy I've settled on for 24GB cards.
Q4_K_M vs IQ4_XS vs Q5_K_M
For GGUF specifically:
- Q4_K_M: solid default. Around 16GB for a 27B model. Quality holds up for most tasks.
- IQ4_XS: smaller (~14GB), uses importance-matrix calibration. Slightly slower to evaluate but frees up room for larger context.
- Q5_K_M: noticeably better quality but pushes ~19GB, which leaves you fighting for KV cache budget.
I ran the same 50-prompt eval set across all three on a 27B model and IQ4_XS lost about 2% on reasoning compared to Q5_K_M. For chat workloads that gap basically disappears. For code generation, it widens.
Backend showdown
This is where the weekend went. Same model, same quant, several runtimes.
llama.cpp
The reliable baseline. CPU offload works, partial GPU layers work, KV cache quantization works. Build with CUDA support:
# Build with CUDA backend enabled
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -jThen run with enough layers offloaded to fill VRAM:
./build/bin/llama-server \
-m qwen-27b-iq4xs.gguf \
-ngl 99 \ # offload all layers we can
-c 8192 \ # context size
-ctk q8_0 -ctv q8_0 \ # quantize KV cache to 8-bit
--flash-attnThe -ctk and -ctv flags are the unsung heroes. Quantizing the KV cache to Q8_0 roughly halves cache memory with negligible quality loss. Skip this and you'll hit OOM around 4K context. With it, 16K is comfortable on a 24GB card. Docs at github.com/ggml-org/llama.cpp.
ik_llama.cpp
A fork that's been pushing more aggressive quantization formats and CPU-side optimizations. The IQK-family quants it ships are sharper than vanilla IQ4_XS at similar sizes. For pure GPU inference the speedup over upstream llama.cpp is modest, maybe 10-15% on my tests. For mixed CPU/GPU it's larger.
I haven't tested every quant format thoroughly, but the IQ4_KS variant gave me the best quality-per-byte on a 27B model at the cost of slightly more complex setup. Worth checking the project's repo for current quant recommendations since this space moves fast.
vLLM
vLLM is the speed king for batched inference, but on a single 24GB card with a 27B model you fight for every megabyte. The catch: vLLM doesn't load GGUF the same way llama.cpp does. You either need an AWQ or GPTQ quant of the model, or accept the FP16 weights won't fit at all.
A config that actually loads:
# Using AWQ 4-bit weights
python -m vllm.entrypoints.openai.api_server \
--model qwen-27b-awq \
--quantization awq \
--gpu-memory-utilization 0.92 \
--max-model-len 8192 \
--kv-cache-dtype fp8The --gpu-memory-utilization flag is critical. Default 0.9 sometimes works, sometimes OOMs depending on your driver state. I drop it to 0.88 if I want to keep a browser open on the same GPU. See docs.vllm.ai for the full flag list.
For single-user latency, vLLM is fast but not dramatically faster than llama.cpp with flash attention. Where it pulls ahead is concurrent requests — paged attention shines with batching.
The order of operations that actually works
After enough trial and error, this is the checklist I run whenever I'm fitting a new 27B-class model on 24GB:
--flash-attn in llama.cpp).nvidia-smi during prefill, not just generation. Activation spikes are real.Only after that baseline works do I start swapping in larger quants or higher context windows.
What I'd avoid
A few things that wasted my time so you don't have to:
- Don't trust the model's advertised context length on a single 24GB card. A checkpoint trained at 128K won't fit 128K of KV cache. Cap it.
- Don't load FP16 with
--load-format autoin vLLM. It'll try and crash. Be explicit about quantization. - Don't mix CPU offload with batched serving. The latency variance is brutal.
- Don't benchmark with a cold cache. Run a warmup pass before measuring tokens/sec.
Prevention: monitor before you launch
The single most useful habit I picked up: keep a terminal with watch -n 0.5 nvidia-smi open while I'm tuning. You'll see the moment a config is about to OOM versus when it's settling. The temptation is to keep cranking context length until it crashes — resist that and instead watch VRAM converge during a real prefill.
Most of the OOMs I've seen aren't about the model being too big. They're about someone setting context length aspirationally and forgetting the KV cache scales with it. Once you internalize that the model weights are the floor and everything else stacks on top, the tuning gets much faster.
I haven't tested every backend in this space thoroughly, but the principles above hold across runtimes. Quant choice, KV cache quantization, and explicit context limits are the three knobs that matter most.
