You've built your app, integrated an LLM API, and everything works great for a few weeks. Then one morning, the outputs start feeling... off. Responses are shorter, less accurate, or just plain wrong. Your token usage is through the roof, your costs are climbing, and when you file a support ticket, you get a canned response that helps nobody.
I've been there. Multiple times. After debugging these issues across several projects, I've found that most LLM integration problems fall into three buckets: token mismanagement, prompt drift, and a lack of observability. Let's fix all three.
The Token Problem Nobody Talks About
Most developers think about tokens as a billing concern. They're not — they're an architecture concern. When your context window fills up, the model doesn't just cost more. It performs worse.
Here's what typically happens: you start with a clean prompt, the model nails it, and over time you keep appending instructions, few-shot examples, and system messages. Before you know it, you're sending 80k tokens per request and the model is drowning in context.
# Bad: stuffing everything into one massive prompt
def get_response(user_input, conversation_history, system_prompt, examples):
messages = [
{"role": "system", "content": system_prompt}, # 2k tokens
*examples, # 15k tokens of few-shot examples
*conversation_history, # grows unbounded
{"role": "user", "content": user_input}
]
return client.chat(messages=messages)The fix starts with actually measuring your token usage per request. Not just for billing — for performance.
import tiktoken
def count_tokens(messages, model="gpt-4"):
"""Count tokens BEFORE sending to the API."""
encoding = tiktoken.encoding_for_model(model)
total = 0
for msg in messages:
total += len(encoding.encode(msg["content"]))
total += 4 # overhead per message (role, separators)
return total
def trim_conversation(messages, max_tokens=4000):
"""Keep recent messages within a token budget."""
system = [m for m in messages if m["role"] == "system"]
others = [m for m in messages if m["role"] != "system"]
trimmed = []
token_count = count_tokens(system)
# Walk backwards from most recent, keep what fits
for msg in reversed(others):
msg_tokens = count_tokens([msg])
if token_count + msg_tokens > max_tokens:
break
trimmed.insert(0, msg)
token_count += msg_tokens
return system + trimmedThis alone fixed a quality regression in a project I worked on last month. We'd been sending 60k tokens when the useful context was maybe 8k.
Why Output Quality Degrades Over Time
This one is sneaky. You ship a feature, it works, and three weeks later users complain the AI responses are worse. You haven't changed your code. What happened?
A few common culprits:
- Model version changes. Providers update models without changing the API endpoint. Your
gpt-4from January isn't the samegpt-4from April. Pin your model versions explicitly. - Prompt sensitivity. Small changes to your system prompt — even reordering sentences — can produce wildly different outputs. Treat prompts like code: version them, test them, review them.
- Context pollution. If you're feeding conversation history or retrieval results into the context, garbage data accumulates over time.
The defensive fix is to build an evaluation pipeline. It doesn't have to be fancy.
import json
from datetime import datetime
# Define test cases with expected behaviors (not exact matches)
TEST_CASES = [
{
"input": "Summarize this error log: KeyError at line 42 in auth.py",
"must_contain": ["KeyError", "auth.py"],
"must_not_contain": ["I'm sorry", "I cannot"],
"max_tokens": 200 # response shouldn't be a novel
},
{
"input": "Convert 45°C to Fahrenheit",
"must_contain": ["113"],
"must_not_contain": [],
"max_tokens": 100
}
]
def run_eval(client, model, test_cases):
results = []
for case in test_cases:
response = client.chat(
model=model,
messages=[{"role": "user", "content": case["input"]}]
)
text = response.content
passed = all(term in text for term in case["must_contain"])
passed &= not any(term in text for term in case["must_not_contain"])
passed &= len(text.split()) <= case["max_tokens"]
results.append({
"input": case["input"],
"passed": passed,
"response_length": len(text.split()),
"timestamp": datetime.utcnow().isoformat()
})
return resultsRun this daily. When scores drop, you know immediately whether it's your code or the model that changed. I keep a simple JSON log of results and graph them weekly — nothing fancy, just enough to catch regressions before users do.
Building Observability You'll Actually Use
The biggest mistake I see in LLM integrations is treating the API like a black box. You send a request, get a response, done. No logging, no metrics, no way to debug when things go sideways.
At minimum, log these for every API call:
- Input token count and output token count
- Latency (time to first token and total)
- Model version (the exact version string from the response headers)
- Temperature and other parameters used
- A hash of the prompt template so you can correlate quality changes with prompt changes
import time
import hashlib
import logging
logger = logging.getLogger("llm_observability")
def observed_chat(client, messages, **kwargs):
"""Wrapper that adds observability to any LLM call."""
prompt_hash = hashlib.md5(
json.dumps(messages, sort_keys=True).encode()
).hexdigest()[:8]
start = time.monotonic()
response = client.chat(messages=messages, **kwargs)
latency = time.monotonic() - start
logger.info(
"llm_call",
extra={
"prompt_hash": prompt_hash,
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens,
"latency_seconds": round(latency, 3),
"model": response.model, # actual model used, not requested
"temperature": kwargs.get("temperature", "default"),
}
)
return responseOnce you have this data flowing, patterns jump out. You'll notice that your Monday morning latency is 3x worse (cold caches), that one prompt template accounts for 70% of your token spend, or that the model version quietly changed last Thursday — right when the complaints started.
When to Walk Away From a Provider
Sometimes the problem isn't your code. If you're consistently hitting rate limits that don't match your plan, getting degraded responses during peak hours, or support takes weeks to acknowledge a real issue, it might be time to abstract your integration.
from abc import ABC, abstractmethod
class LLMProvider(ABC):
@abstractmethod
def chat(self, messages, **kwargs):
pass
class OpenAIProvider(LLMProvider):
def chat(self, messages, **kwargs):
# OpenAI-specific implementation
pass
class AnthropicProvider(LLMProvider):
def chat(self, messages, **kwargs):
# Anthropic-specific implementation
pass
class OllamaProvider(LLMProvider):
def chat(self, messages, **kwargs):
# Local model via Ollama — your escape hatch
passThis isn't over-engineering if you've already been burned by a provider change. The abstraction layer takes an afternoon to build and saves you weeks of emergency migration later. I now start every LLM project with this pattern after learning the hard way.
Prevention Checklist
Before you ship an LLM-powered feature, make sure you have:
- Token budgets — hard limits on input context size, enforced in code
- Model version pinning — use date-stamped model IDs, not aliases
- Automated evals — even five test cases run daily catches most regressions
- Structured logging — tokens, latency, model version on every call
- A provider abstraction — even a thin one, so you're not locked in
- Fallback behavior — what does your app do when the API returns garbage or times out?
None of this is glamorous. Nobody's writing blog posts titled "I Added Logging to My LLM Integration and It Changed My Life." But it's the difference between an AI feature that works reliably in production and one that slowly degrades until someone on Hacker News writes a post about cancelling the service.
The LLM space moves fast. Models change, pricing shifts, quality fluctuates. The developers who build resilient integrations — with proper observability, testing, and abstraction — are the ones who don't wake up to a broken product on a random Tuesday morning.
