Why Your LLM API Outputs Are Getting Worse (And How to Fix It)

You've built your app, integrated an LLM API, and everything works great for a few weeks. Then one morning, the outputs start feeling... off. Responses are shorter, less accurate, or just plain wrong. Your token usage is through the roof, your costs are climbing, and when you file a support ticket, you get a canned response that helps nobody.

I've been there. Multiple times. After debugging these issues across several projects, I've found that most LLM integration problems fall into three buckets: token mismanagement, prompt drift, and a lack of observability. Let's fix all three.

The Token Problem Nobody Talks About

Most developers think about tokens as a billing concern. They're not — they're an architecture concern. When your context window fills up, the model doesn't just cost more. It performs worse.

Here's what typically happens: you start with a clean prompt, the model nails it, and over time you keep appending instructions, few-shot examples, and system messages. Before you know it, you're sending 80k tokens per request and the model is drowning in context.

python

# Bad: stuffing everything into one massive prompt
def get_response(user_input, conversation_history, system_prompt, examples):
    messages = [
        {"role": "system", "content": system_prompt},  # 2k tokens
        *examples,           # 15k tokens of few-shot examples
        *conversation_history,  # grows unbounded
        {"role": "user", "content": user_input}
    ]
    return client.chat(messages=messages)

The fix starts with actually measuring your token usage per request. Not just for billing — for performance.

python

import tiktoken

def count_tokens(messages, model="gpt-4"):
    """Count tokens BEFORE sending to the API."""
    encoding = tiktoken.encoding_for_model(model)
    total = 0
    for msg in messages:
        total += len(encoding.encode(msg["content"]))
        total += 4  # overhead per message (role, separators)
    return total

def trim_conversation(messages, max_tokens=4000):
    """Keep recent messages within a token budget."""
    system = [m for m in messages if m["role"] == "system"]
    others = [m for m in messages if m["role"] != "system"]
    
    trimmed = []
    token_count = count_tokens(system)
    
    # Walk backwards from most recent, keep what fits
    for msg in reversed(others):
        msg_tokens = count_tokens([msg])
        if token_count + msg_tokens > max_tokens:
            break
        trimmed.insert(0, msg)
        token_count += msg_tokens
    
    return system + trimmed

This alone fixed a quality regression in a project I worked on last month. We'd been sending 60k tokens when the useful context was maybe 8k.

Why Output Quality Degrades Over Time

This one is sneaky. You ship a feature, it works, and three weeks later users complain the AI responses are worse. You haven't changed your code. What happened?

A few common culprits:

Model version changes. Providers update models without changing the API endpoint. Your gpt-4 from January isn't the same gpt-4 from April. Pin your model versions explicitly.
Prompt sensitivity. Small changes to your system prompt — even reordering sentences — can produce wildly different outputs. Treat prompts like code: version them, test them, review them.
Context pollution. If you're feeding conversation history or retrieval results into the context, garbage data accumulates over time.

The defensive fix is to build an evaluation pipeline. It doesn't have to be fancy.

python

import json
from datetime import datetime

# Define test cases with expected behaviors (not exact matches)
TEST_CASES = [
    {
        "input": "Summarize this error log: KeyError at line 42 in auth.py",
        "must_contain": ["KeyError", "auth.py"],
        "must_not_contain": ["I'm sorry", "I cannot"],
        "max_tokens": 200  # response shouldn't be a novel
    },
    {
        "input": "Convert 45°C to Fahrenheit",
        "must_contain": ["113"],
        "must_not_contain": [],
        "max_tokens": 100
    }
]

def run_eval(client, model, test_cases):
    results = []
    for case in test_cases:
        response = client.chat(
            model=model,
            messages=[{"role": "user", "content": case["input"]}]
        )
        text = response.content
        
        passed = all(term in text for term in case["must_contain"])
        passed &= not any(term in text for term in case["must_not_contain"])
        passed &= len(text.split()) <= case["max_tokens"]
        
        results.append({
            "input": case["input"],
            "passed": passed,
            "response_length": len(text.split()),
            "timestamp": datetime.utcnow().isoformat()
        })
    
    return results

Run this daily. When scores drop, you know immediately whether it's your code or the model that changed. I keep a simple JSON log of results and graph them weekly — nothing fancy, just enough to catch regressions before users do.

Building Observability You'll Actually Use

The biggest mistake I see in LLM integrations is treating the API like a black box. You send a request, get a response, done. No logging, no metrics, no way to debug when things go sideways.

At minimum, log these for every API call:

Input token count and output token count
Latency (time to first token and total)
Model version (the exact version string from the response headers)
Temperature and other parameters used
A hash of the prompt template so you can correlate quality changes with prompt changes

python

import time
import hashlib
import logging

logger = logging.getLogger("llm_observability")

def observed_chat(client, messages, **kwargs):
    """Wrapper that adds observability to any LLM call."""
    prompt_hash = hashlib.md5(
        json.dumps(messages, sort_keys=True).encode()
    ).hexdigest()[:8]
    
    start = time.monotonic()
    response = client.chat(messages=messages, **kwargs)
    latency = time.monotonic() - start
    
    logger.info(
        "llm_call",
        extra={
            "prompt_hash": prompt_hash,
            "input_tokens": response.usage.prompt_tokens,
            "output_tokens": response.usage.completion_tokens,
            "latency_seconds": round(latency, 3),
            "model": response.model,  # actual model used, not requested
            "temperature": kwargs.get("temperature", "default"),
        }
    )
    return response

Once you have this data flowing, patterns jump out. You'll notice that your Monday morning latency is 3x worse (cold caches), that one prompt template accounts for 70% of your token spend, or that the model version quietly changed last Thursday — right when the complaints started.

When to Walk Away From a Provider

Sometimes the problem isn't your code. If you're consistently hitting rate limits that don't match your plan, getting degraded responses during peak hours, or support takes weeks to acknowledge a real issue, it might be time to abstract your integration.

python

from abc import ABC, abstractmethod

class LLMProvider(ABC):
    @abstractmethod
    def chat(self, messages, **kwargs):
        pass

class OpenAIProvider(LLMProvider):
    def chat(self, messages, **kwargs):
        # OpenAI-specific implementation
        pass

class AnthropicProvider(LLMProvider):
    def chat(self, messages, **kwargs):
        # Anthropic-specific implementation
        pass

class OllamaProvider(LLMProvider):
    def chat(self, messages, **kwargs):
        # Local model via Ollama — your escape hatch
        pass

This isn't over-engineering if you've already been burned by a provider change. The abstraction layer takes an afternoon to build and saves you weeks of emergency migration later. I now start every LLM project with this pattern after learning the hard way.

Prevention Checklist

Before you ship an LLM-powered feature, make sure you have:

Token budgets — hard limits on input context size, enforced in code
Model version pinning — use date-stamped model IDs, not aliases
Automated evals — even five test cases run daily catches most regressions
Structured logging — tokens, latency, model version on every call
A provider abstraction — even a thin one, so you're not locked in
Fallback behavior — what does your app do when the API returns garbage or times out?

None of this is glamorous. Nobody's writing blog posts titled "I Added Logging to My LLM Integration and It Changed My Life." But it's the difference between an AI feature that works reliably in production and one that slowly degrades until someone on Hacker News writes a post about cancelling the service.

The LLM space moves fast. Models change, pricing shifts, quality fluctuates. The developers who build resilient integrations — with proper observability, testing, and abstraction — are the ones who don't wake up to a broken product on a random Tuesday morning.