You wake up, check Hacker News, and see the announcement: a new flagship model just dropped. Better benchmarks, new capabilities, maybe even a different pricing tier. Your first instinct is to swap the model ID string and ship it. I've done this. It broke things.
After migrating production LLM integrations across multiple model upgrades — including the recent Claude model family updates — I've learned that "just change the model string" is a recipe for subtle, hard-to-debug regressions. Here's why it fails and how to do it properly.
Why Swapping the Model ID Breaks Things
The core issue is that your prompts were tuned for a specific model's behavior. Even when a new model is strictly "better" on benchmarks, three things commonly shift:
- Output formatting changes. A model that used to return clean JSON might now add markdown fences or explanatory text around it. Your parser chokes.
- Instruction sensitivity shifts. Prompts that needed heavy-handed repetition on the old model might now over-comply on the new one. Prompts that relied on implicit behavior might get interpreted differently.
- Token usage and latency profiles change. A model that's smarter per-token might use more tokens to be thorough, blowing past your
max_tokensbudget.
I hit all three of these in a single migration last year. The JSON parsing failures were the loudest. The subtle prompt behavior shifts were the most dangerous — they passed tests but produced worse results in edge cases.
Step 1: Set Up a Shadow Pipeline
Before touching production, run the new model in parallel. This doesn't need to be fancy:
import asyncio
from datetime import datetime
async def shadow_compare(prompt: str, client, old_model: str, new_model: str):
# Run both models concurrently — don't pay the latency cost twice
old_task = client.messages.create(
model=old_model,
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
new_task = client.messages.create(
model=new_model,
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
old_resp, new_resp = await asyncio.gather(old_task, new_task)
return {
"timestamp": datetime.utcnow().isoformat(),
"prompt_hash": hash(prompt),
"old_output": old_resp.content[0].text,
"new_output": new_resp.content[0].text,
"old_tokens": old_resp.usage.output_tokens,
"new_tokens": new_resp.usage.output_tokens,
}Log those results. You're building a dataset of behavioral differences, and you want it before you start changing prompts.
Step 2: Build an Eval Suite (Yes, Actually)
I know, I know. "Evals" sounds like something you'll get to later. But you don't need a massive framework. A simple assertion-based test file works:
import json
import pytest
# Your real production prompts with known-good outputs
TEST_CASES = [
{
"prompt": "Extract the date from: 'Meeting scheduled for March 15th, 2026'",
"must_contain": ["2026-03-15"], # Expected format
"must_not_contain": ["@pytest.mark.parametrize("case", TEST_CASES)
def test_model_output(case, model_response):
output = model_response(case["prompt"])
for expected in case.get("must_contain", []):
assert expected.lower() in output.lower(), f"Missing: {expected}"
for forbidden in case.get("must_not_contain", []):
assert forbidden not in output, f"Found forbidden: {forbidden}"
if case.get("valid_json"):
# This is where migrations break most often
stripped = output.strip().removeprefix("``json").removesuffix("`").strip()
json.loads(stripped)
The `must_not_contain` and JSON checks are doing the heavy lifting here. These are the exact failure modes that bite you during model migrations.
## Step 3: Fix Your Output Parsing (It's Probably Fragile)
Most migration breakage comes from brittle parsing. The fix isn't to keep prompting harder — it's to make your parser resilient:
import json
import re
def extract_json_robust(text: str) -> dict:
"""Parse JSON from LLM output regardless of wrapping."""
# Try direct parse first
try:
return json.loads(text.strip())
except json.JSONDecodeError:
pass
# Strip markdown code fences (common with newer models)
fence_pattern = r'`(?:json)?\s\n?(.?)\n?\s*`'``
match = re.search(fence_pattern, text, re.DOTALL)
if match:
try:
return json.loads(match.group(1).strip())
except json.JSONDecodeError:
pass
# Last resort: find the first { ... } or [ ... ] block
for start_char, end_char in [('{', '}'), ('[', ']')]:
start = text.find(start_char)
end = text.rfind(end_char)
if start != -1 and end != -1 and end > start:
try:
return json.loads(text[start:end + 1])
except json.JSONDecodeError:
continue
raise ValueError(f"No valid JSON found in response: {text[:200]}...")
I've shipped variations of this function in four different projects now. It handles 99% of the formatting drift you'll see between model versions.
## Step 4: Migrate Prompts Incrementally
Don't rewrite all your prompts at once. Newer models are generally better at following instructions, which means you can often *simplify* prompts. But do it one at a time:
1. **Run your eval suite against the new model with your existing prompts.** Note which ones fail.
2. **Fix the failures first.** Usually this means tightening output format instructions or removing workarounds that the old model needed.
3. **Then simplify the passing prompts.** Remove redundant instruction repetition. Newer models tend to need less hand-holding.
4. **Re-run evals after each change.** This sounds tedious. It is. It's also way less tedious than debugging production issues.
## Step 5: Use Feature Flags for the Rollout
Environment variables work fine for this. Don't overthink it:In your .env or deployment config
LLM_MODEL_PRIMARY=claude-sonnet-4-5-20241022
LLM_MODEL_CANARY=claude-opus-4-6
LLM_CANARY_PERCENTAGE=10
Route a small percentage of traffic to the new model, monitor your error rates and output quality metrics, then ramp up. If something goes sideways, you flip one variable and you're back on the old model.
Prevention: Build Migration-Friendly Integrations
After going through this enough times, I've settled on a few patterns that make future migrations painless:
- Never hardcode model IDs. Always pull from config.
- Always parse LLM output defensively. Assume the format will drift.
- Keep an eval suite that runs in CI. Even 10-20 test cases catch most regressions.
- Log raw model outputs in production (with appropriate PII handling). When something breaks, you want to see exactly what the model said, not just your downstream error.
- Abstract your LLM client behind an interface. Not for "provider portability" (that's usually YAGNI), but so you have one place to add retry logic, logging, and model routing.
The Real Lesson
Model upgrades aren't like library upgrades. There's no changelog that says "removed support for X output format." The changes are probabilistic — things that worked 99% of the time might now work 97% of the time, or 100% of the time but in a slightly different format.
Treat every model migration like a dependency upgrade that touches every function in your codebase. Because, in a way, it does.
The good news: each migration gets easier if you build the right infrastructure. The shadow pipeline, the eval suite, the defensive parsing — you set these up once, and every future model upgrade goes from "terrifying production incident" to "boring Tuesday deployment." And boring deployments are the best kind.
