AuthonAuthon Blog
debugging6 min read

Why your LLM integration breaks in production and how to fix it

Your LLM integration works in dev but falls over in production. Here's the root cause and a step-by-step fix with timeouts, retries, and schema validation.

AW
Alan West
Authon Team
Why your LLM integration breaks in production and how to fix it

Last month I shipped an "AI feature" that worked beautifully in development. Within 12 hours of going live, our error dashboard looked like a Christmas tree. Timeouts, malformed JSON, hallucinated function names, the works.

The problem wasn't the model. The problem was that I'd treated a probabilistic API like a deterministic one, and production traffic doesn't care about my optimism.

This is the actual lesson behind the slogan that AI is a technology, not a product. When you wrap an LLM call in your code, you're not "adding a product feature." You're integrating a flaky, non-deterministic, latency-spiking, occasionally-creative dependency. Treat it accordingly.

Here's how to stop your LLM integrations from collapsing the moment real users touch them.

The naive pattern that always fails

Most broken LLM integrations I've debugged look something like this:

python
import llm_client  # any LLM SDK

def summarize(text: str) -> str:
    response = llm_client.chat.completions.create(
        model="some-model",
        messages=[{"role": "user", "content": f"Summarize: {text}"}],
    )
    return response.choices[0].message.content

It works on a clean input in dev. Then in prod, you discover:

  • The provider returns a 529 because they're overloaded
  • A user pastes 200k tokens and you blow the context window
  • The network blips for 30 seconds and your request hangs forever
  • The model returns a polite refusal instead of a summary
  • The model returns valid prose, but downstream code expected JSON

None of this is a bug in the model. It's the difference between calling code you control and calling a remote, probabilistic black box.

Root cause: LLM calls are I/O, not pure functions

The mental model that breaks people is treating summarize(text) as a function. It's not. It's a distributed RPC to a contended service that may also generate freeform text you then have to parse.

Three failure classes you must design for from day one:

  • Transport failures: timeouts, 5xx, rate limits, connection resets
  • Output failures: wrong format, truncated output, hallucinated fields
  • Semantic failures: technically valid output that's wrong for your use case

You'd never write an HTTP call to a flaky third-party API without retries, timeouts, and a circuit breaker. LLM calls deserve at least the same hygiene, plus structured-output validation on top.

Step 1: Bound your calls properly

Set explicit timeouts. Always. The default for many clients is "wait basically forever," which becomes a head-of-line blocking disaster the moment upstream slows down.

python
import httpx
from tenacity import retry, stop_after_attempt, wait_exponential_jitter

# Use httpx with strict, separate timeouts for each phase
http_client = httpx.Client(
    timeout=httpx.Timeout(
        connect=5.0,   # fail fast on connection
        read=30.0,     # don't wait forever for tokens
        write=10.0,
        pool=5.0,
    )
)

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential_jitter(initial=1, max=10),
    reraise=True,
)
def call_llm(messages: list[dict]) -> str:
    resp = http_client.post(
        "https://api.example.com/v1/chat/completions",
        json={"model": "some-model", "messages": messages},
    )
    resp.raise_for_status()
    return resp.json()["choices"][0]["message"]["content"]

Two things are doing the work here. The httpx.Timeout makes hangs impossible — you get an exception instead of a wedged worker. The tenacity retry with jittered exponential backoff handles transient 429s and 5xx without hammering the provider when it's already on fire.

tenacity is small, boring, and battle-tested. Docs at https://tenacity.readthedocs.io/ if you haven't used it before.

Step 2: Validate the output, don't trust it

If you ask the model for JSON in a prompt, you'll get JSON about 97% of the time. The other 3% will take down your endpoint at 2am.

Pin the output shape with a schema and parse it explicitly. Pydantic is the path of least resistance in Python:

python
from pydantic import BaseModel, ValidationError
import json, logging

log = logging.getLogger(__name__)

class SummaryResult(BaseModel):
    summary: str
    key_points: list[str]
    confidence: float

def parse_summary(raw: str) -> SummaryResult | None:
    # Models sometimes wrap JSON in prose or markdown fences
    cleaned = raw.strip().removeprefix("json").strip("` \n")
    try:
        return SummaryResult.model_validate_json(cleaned)
    except (ValidationError, json.JSONDecodeError) as e:
        # Log the raw output — you'll want this in your eval set later
        log.warning("llm.parse_failed raw=%r error=%s", raw, e)
        return None

Two patterns I now reach for every time:

  • If the provider supports a structured-output or JSON mode, use it. It dramatically reduces format drift.
  • Keep the parsing function separate from the call. Much easier to unit-test against captured failures.

When parsing fails, decide your fallback per use case. Sometimes you retry with a stricter prompt. Sometimes you degrade gracefully (return a "couldn't summarize" state). Never silently 500. See the Pydantic docs at https://docs.pydantic.dev/ for the schema features.

Step 3: Make it observable

You cannot fix what you cannot see. The minimum I now ship with any LLM-touching endpoint:

  • Request and response body (with PII redaction) sampled to logs
  • Latency histograms split by model and prompt template
  • Token counts in and out per request
  • Counters for parse-failure, refusal, and retry-exhausted events

OpenTelemetry covers all of this with very little code. The semantic conventions for GenAI calls are still settling, but you can use generic spans today and tag them with model name, prompt template ID, and result status.

python
from opentelemetry import trace

tracer = trace.get_tracer("llm")

def summarize(text: str) -> SummaryResult | None:
    with tracer.start_as_current_span("llm.summarize") as span:
        span.set_attribute("llm.model", "some-model")
        span.set_attribute("llm.input_chars", len(text))
        raw = call_llm([{"role": "user", "content": text}])
        result = parse_summary(raw)
        span.set_attribute("llm.parse_ok", result is not None)
        return result

When something goes wrong in prod, you can pivot on llm.parse_ok = false and find every offending input in seconds. That feedback loop is how you build an eval set instead of a graveyard of customer tickets. OpenTelemetry docs at https://opentelemetry.io/docs/.

Prevention: build the harness before the feature

Lessons I keep relearning the hard way:

  • Capture every failed output to a dataset. That dataset becomes your regression test.
  • Write a smoke test that exercises the full pipeline against a recorded fixture, not the live API. Libraries like respx or vcrpy let you replay HTTP deterministically.
  • Set a hard budget per request — both time and tokens — and enforce it in code, not by hoping prompts stay short.
  • Treat prompt files as code. Version them, review them in PRs, and tag spans with the prompt version so you can correlate quality regressions to specific changes.

The slogan in the title isn't a hot take, it's an engineering instruction. AI is a technology. That means it gets the same retries, timeouts, validation, observability, and tests as every other piece of plumbing in your stack. The teams that internalize this ship features that survive contact with real users. The ones that don't keep paging themselves at 2am.

Why your LLM integration breaks in production and how to fix it | Authon Blog