AuthonAuthon Blog
debugging7 min read

Why Your LLM Agent Runs Out of Memory Mid-Task and How to Fix It

Agentic AI workloads exhaust accelerator memory fast. Learn how to debug KV cache bloat and fix it with context compaction, cache quantization, and smarter agent design.

AW
Alan West
Authon Team
Why Your LLM Agent Runs Out of Memory Mid-Task and How to Fix It

You've built an AI agent that works beautifully on short tasks. It browses docs, writes code, runs tests. Then you hand it something complex — a multi-file refactor, a research task with dozens of sources — and it dies halfway through. Out of memory. Every time.

I spent two weeks debugging this exact problem across three different agent deployments last month. The root cause isn't what you'd expect, and the fix is more about architecture than throwing hardware at it.

The Real Problem: KV Cache Growth Is Exponential, Your VRAM Is Not

When your agent runs a multi-step task, each step adds to the conversation context. The model needs to attend to all previous tokens, which means the key-value cache grows with every tool call, every observation, every chain-of-thought block.

Here's the math that kills you. For a typical transformer-based model:

python
# KV cache memory per token (approximate)
# 2 (key + value) * num_layers * hidden_dim * precision_bytes
# For a 70B-class model with 80 layers, 8192 hidden dim, fp16:
memory_per_token = 2 * 80 * 8192 * 2  # bytes
memory_per_token_mb = memory_per_token / (1024 * 1024)
print(f"Per token: {memory_per_token_mb:.2f} MB")
# Per token: ~2.5 MB

# At 32K context, that's ~80 GB just for the KV cache
# At 128K context, you're looking at ~320 GB
# Your agent hit 64K tokens on step 12 of 30. Good luck.

The model weights themselves are already eating most of your accelerator memory. The KV cache is fighting for whatever's left. And agentic workflows are uniquely bad at this because they generate way more tokens than a single prompt-response cycle.

Step 1: Instrument Your Agent's Memory Usage

Before you fix anything, measure. I was shocked at how much context my agents were actually accumulating.

python
import tiktoken

class AgentMemoryTracker:
    def __init__(self, model_name="cl100k_base"):
        self.encoder = tiktoken.get_encoding(model_name)
        self.step_tokens = []
    
    def track_step(self, messages: list[dict]):
        """Call this after each agent step to log context size."""
        total_tokens = sum(
            len(self.encoder.encode(m["content"]))
            for m in messages
            if m.get("content")
        )
        self.step_tokens.append(total_tokens)
        
        # Warn if we're approaching danger zone
        if total_tokens > 30_000:
            growth_rate = (
                total_tokens / self.step_tokens[0]
                if self.step_tokens[0] > 0 else 0
            )
            print(
                f"WARNING: Context at {total_tokens} tokens "
                f"({growth_rate:.1f}x initial). "
                f"Consider compaction."
            )
        return total_tokens

When I added this to my pipeline, I discovered that tool outputs were the primary offender. A single web scrape or file read could dump 4,000+ tokens into context. After ten tool calls, we'd burned through half our context window on observations the model barely referenced again.

Step 2: Implement Sliding Window Context Compaction

The fix that made the biggest difference was aggressive context compaction. The idea is simple: your agent doesn't need the full raw output of every previous step. It needs a summary of what it learned.

python
def compact_agent_context(messages, max_tokens=16000):
    """
    Keep the system prompt and recent messages intact.
    Summarize older tool outputs aggressively.
    """
    system_msgs = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]
    
    # Always keep the last N exchanges in full detail
    KEEP_RECENT = 6  # 3 assistant + 3 user/tool messages
    recent = non_system[-KEEP_RECENT:]
    older = non_system[:-KEEP_RECENT]
    
    if not older:
        return messages
    
    # Collapse older tool results into compact summaries
    compacted = []
    for msg in older:
        if msg["role"] == "tool" and len(msg["content"]) > 500:
            # Replace verbose tool output with a brief note
            compacted.append({
                "role": msg["role"],
                "content": truncate_with_summary(msg["content"], 200),
                "tool_call_id": msg.get("tool_call_id")
            })
        else:
            compacted.append(msg)
    
    return system_msgs + compacted + recent

This alone cut our peak memory usage by about 40%. The agent still had access to recent detailed context, and older steps were summarized enough to maintain coherence.

Step 3: Batch Your KV Cache Intelligently

If you're serving multiple agent sessions, naive per-request KV caching will destroy you. The trick is prefix sharing — when multiple agents use the same system prompt and tool definitions, their KV caches share a common prefix.

Most inference frameworks (vLLM, TensorRT-LLM, SGLang) support some form of automatic prefix caching. But you have to structure your prompts to actually benefit from it:

  • Put your system prompt and tool schemas first and keep them identical across sessions
  • Put session-specific context after the shared prefix
  • Avoid randomizing tool order or injecting timestamps into the system prompt

This matters more than you'd think. With 50 concurrent agent sessions sharing an 800-token system prefix, you save roughly 40K tokens worth of KV cache memory. That's memory freed up for actual context.

Step 4: Use Quantized KV Caches

This one's newer and surprisingly effective. Instead of storing your key-value cache in fp16, you can quantize it to int8 or even int4 with minimal quality loss.

In vLLM, you can enable this with:

bash
# Start vLLM with KV cache quantization
python -m vllm.entrypoints.openai.api_server \
    --model your-model-path \
    --kv-cache-dtype fp8_e5m2 \
    --max-model-len 131072

fp8 KV caches cut memory usage roughly in half compared to fp16, and in my benchmarks the output quality difference was negligible for agentic tasks. The agent still planned, used tools, and self-corrected just as well. Where you might notice degradation is on tasks requiring very precise numerical reasoning over extremely long contexts — but most agent workflows aren't doing that.

Step 5: Offload What You Don't Need Right Now

For really long-running agents (think: autonomous coding sessions that run for 30+ minutes), even compacted context gets large. The nuclear option is KV cache offloading — moving older cache layers to CPU RAM or even disk, and pulling them back when the attention mechanism needs them.

This introduces latency, obviously. But for agentic workflows where the bottleneck is often external (waiting for API calls, test suites, builds), the offloading overhead is masked by natural pauses in the pipeline. The agent is waiting for pytest to finish anyway — it doesn't care if a cache page fault added 50ms.

Prevention: Design Agents That Stay Lean

The best fix is not needing one. A few patterns I've adopted:

  • Structured tool outputs: Make your tools return JSON with only the fields the agent needs, not raw HTML or full file contents
  • Observation budgets: Set a hard cap on tool output tokens (I use 1,500 per tool call) and truncate with a note that the full output is available if needed
  • Hierarchical agents: Instead of one agent with a 200K context window, use a coordinator that delegates to sub-agents with fresh, small contexts. Each sub-agent handles one subtask and returns a summary
  • Checkpoint and resume: For very long tasks, serialize the agent's state summary to disk periodically. If it crashes, you restart from the checkpoint rather than replaying the full history

The Bigger Picture

Hardware will keep getting better. We're seeing accelerators with significantly more high-bandwidth memory, better interconnects for distributed inference, and chips specifically optimized for the long sequential token generation patterns that agentic workloads demand. That's great.

But hardware improvements don't excuse sloppy software architecture. Every generation of accelerator unlocks bigger models and longer contexts, and agentic workloads will expand to fill whatever memory you give them. The teams I've seen run agents reliably in production all have one thing in common: they treat context management as a first-class engineering problem, not an afterthought.

Get your memory instrumentation in place now. Your future self — and your future GPU bill — will thank you.

Why Your LLM Agent Runs Out of Memory Mid-Task and How to Fix It | Authon Blog