A Harvard study recently made waves: OpenAI's o1 model reportedly diagnosed 67% of emergency room patients correctly, compared to 50-55% accuracy from triage doctors. Whether or not that number holds up under scrutiny, it highlights something developers building AI classification systems already know — LLMs can be surprisingly good at pattern matching across messy, unstructured input.
But here's the part nobody's tweeting about: getting an LLM to perform well in a research setting and getting it to perform reliably in a production pipeline are two completely different problems.
I've spent the last year building classification systems that use LLMs for intake processing, risk scoring, and routing decisions. The accuracy numbers looked great in testing. Then production traffic hit, and things got weird fast.
Let me walk you through the failure modes I encountered and how I fixed each one.
The Core Problem: Inconsistent Output on Ambiguous Input
Here's the scenario. You've got an LLM classifying incoming data into categories — could be support tickets, insurance claims, medical symptoms, whatever. Your eval set shows 85% accuracy. You ship it.
Within a week, you notice:
- The same input produces different classifications on retry
- Edge cases get confidently wrong answers (no hedging, no uncertainty)
- The model hallucinates categories that don't exist in your schema
Sound familiar? The root cause is almost always the same: you're treating a probabilistic text generator like a deterministic function.
Step 1: Lock Down Your Output Schema
The first fix is embarrassingly simple. Stop accepting free-text classification output.
import json
from pydantic import BaseModel, Field
from enum import Enum
class TriageCategory(str, Enum):
CRITICAL = "critical"
URGENT = "urgent"
STANDARD = "standard"
LOW = "low"
class ClassificationResult(BaseModel):
category: TriageCategory
confidence: float = Field(ge=0.0, le=1.0)
reasoning: str = Field(max_length=500)
# Forces the model to flag when it's unsure
ambiguous: bool = False
differential: list[TriageCategory] = [] # other possible categories
def validate_classification(raw_output: str) -> ClassificationResult:
try:
data = json.loads(raw_output)
return ClassificationResult(**data)
except (json.JSONDecodeError, ValueError) as e:
# Don't silently fall back — route to human review
raise ClassificationError(f"Model output failed validation: {e}")The differential field is the key insight I stole from actual medical practice. When doctors aren't sure, they don't just pick one answer — they list the possibilities. Your model should do the same.
If you're using an API that supports structured outputs or function calling, use that instead of parsing raw text. It eliminates an entire class of formatting errors.
Step 2: Calibrate Confidence Scores (They're Lying to You)
Here's something that bit me hard. When you ask an LLM to self-report confidence, those numbers are essentially made up. A model that says it's 95% confident is not actually right 95% of the time.
import numpy as np
from collections import defaultdict
class ConfidenceCalibrator:
"""Post-hoc calibration using historical predictions vs. outcomes."""
def __init__(self, n_bins: int = 10):
self.n_bins = n_bins
self.bin_boundaries = np.linspace(0, 1, n_bins + 1)
self.calibration_map: dict[int, float] = {}
def fit(self, predicted_confidences: list[float], actual_correct: list[bool]):
"""Build calibration curve from labeled evaluation data."""
bins = defaultdict(list)
for conf, correct in zip(predicted_confidences, actual_correct):
bin_idx = int(np.digitize(conf, self.bin_boundaries)) - 1
bin_idx = min(bin_idx, self.n_bins - 1)
bins[bin_idx].append(correct)
for bin_idx, outcomes in bins.items():
# Actual accuracy for this confidence range
self.calibration_map[bin_idx] = sum(outcomes) / len(outcomes)
def calibrate(self, raw_confidence: float) -> float:
"""Map model's claimed confidence to actual observed accuracy."""
bin_idx = int(np.digitize(raw_confidence, self.bin_boundaries)) - 1
bin_idx = min(bin_idx, self.n_bins - 1)
return self.calibration_map.get(bin_idx, raw_confidence)In my experience, LLMs are consistently overconfident in the 0.7-0.9 range. After calibration, a lot of those "85% confident" predictions turned out to be correct about 60% of the time. That's a massive difference when you're routing decisions based on those numbers.
Step 3: Build a Human-in-the-Loop Escalation Path
This is where most teams cut corners, and it's where the Harvard study comparison gets interesting. The study compared AI-only vs. doctor-only. But in practice, the winning architecture is neither — it's AI + human with clear escalation rules.
class EscalationRouter:
def __init__(self, calibrator: ConfidenceCalibrator,
auto_threshold: float = 0.85,
reject_threshold: float = 0.5):
self.calibrator = calibrator
self.auto_threshold = auto_threshold
self.reject_threshold = reject_threshold
def route(self, result: ClassificationResult) -> str:
calibrated = self.calibrator.calibrate(result.confidence)
# High confidence + no ambiguity = auto-process
if calibrated >= self.auto_threshold and not result.ambiguous:
return "auto_accept"
# Model flagged ambiguity or differential has close alternatives
if result.ambiguous or len(result.differential) > 1:
return "human_review_priority"
# Low confidence = don't even try
if calibrated < self.reject_threshold:
return "human_review_required"
# Middle ground: accept but flag for async audit
return "auto_accept_with_audit"The auto_accept_with_audit path is crucial. It lets you process the majority of clear-cut cases automatically while building a feedback dataset from the audited ones. After a few weeks, you've got labeled data to retrain your calibration curve.
Step 4: Use Eval-Driven Development, Not Vibes
The reason that Harvard study is useful isn't the headline number — it's that they had a clear evaluation methodology. Your classification system needs the same thing.
def run_eval_suite(classify_fn, test_cases: list[dict]) -> dict:
results = {
"total": len(test_cases),
"correct": 0,
"incorrect_but_flagged": 0, # wrong, but model said ambiguous
"incorrect_confident": 0, # wrong AND confident — the scary ones
"consistency": [] # same input, multiple runs
}
for case in test_cases:
# Run each case 3 times to check consistency
outputs = [classify_fn(case["input"]) for _ in range(3)]
categories = [o.category for o in outputs]
results["consistency"].append(len(set(categories)) == 1)
# Use majority vote for accuracy check
from collections import Counter
majority = Counter(categories).most_common(1)[0][0]
if majority == case["expected"]:
results["correct"] += 1
elif any(o.ambiguous for o in outputs):
results["incorrect_but_flagged"] += 1
else:
results["incorrect_confident"] += 1
results["consistency_rate"] = sum(results["consistency"]) / len(results["consistency"])
return resultsThe metric I care about most isn't overall accuracy — it's incorrect_confident. That's the failure mode that causes real damage. A system that's wrong 20% of the time but flags uncertainty is infinitely more useful than one that's wrong 15% of the time but never tells you.
Prevention: The Production Checklist
Before you ship any LLM classification pipeline to production:
- Structured output validation — never trust raw text parsing for critical paths
- Calibrated confidence — run at least 200 labeled examples through calibration before going live
- Escalation routing — define explicit thresholds for auto-accept, audit, and human-review
- Consistency testing — if the same input gives different outputs on retry, your temperature is too high or your prompt is ambiguous
- Eval suite in CI — run your test cases on every prompt change, every model version bump
- Monitoring in production — track confidence distribution drift over time. If your model suddenly gets more confident or less confident across the board, something changed
The Bigger Picture
The headline "AI beats doctors" is reductive. What the research actually suggests is that LLMs are good at synthesizing patterns across large amounts of unstructured text — which is literally what they were built to do.
The developer takeaway isn't "replace humans with LLMs." It's that a well-built classification pipeline with proper calibration, structured outputs, and human escalation can outperform either humans or AI working alone.
Build the pipeline right, measure it honestly, and don't trust the confidence scores until you've calibrated them. That's it. That's the whole thing.
