AuthonAuthon Blog
debugging7 min read

Why Your LLM Classification Pipeline Fails on Edge Cases (and How to Fix It)

How to build reliable LLM classification pipelines for high-stakes decisions — fixing confidence calibration, output validation, and human escalation.

AW
Alan West
Authon Team
Why Your LLM Classification Pipeline Fails on Edge Cases (and How to Fix It)

A Harvard study recently made waves: OpenAI's o1 model reportedly diagnosed 67% of emergency room patients correctly, compared to 50-55% accuracy from triage doctors. Whether or not that number holds up under scrutiny, it highlights something developers building AI classification systems already know — LLMs can be surprisingly good at pattern matching across messy, unstructured input.

But here's the part nobody's tweeting about: getting an LLM to perform well in a research setting and getting it to perform reliably in a production pipeline are two completely different problems.

I've spent the last year building classification systems that use LLMs for intake processing, risk scoring, and routing decisions. The accuracy numbers looked great in testing. Then production traffic hit, and things got weird fast.

Let me walk you through the failure modes I encountered and how I fixed each one.

The Core Problem: Inconsistent Output on Ambiguous Input

Here's the scenario. You've got an LLM classifying incoming data into categories — could be support tickets, insurance claims, medical symptoms, whatever. Your eval set shows 85% accuracy. You ship it.

Within a week, you notice:

  • The same input produces different classifications on retry
  • Edge cases get confidently wrong answers (no hedging, no uncertainty)
  • The model hallucinates categories that don't exist in your schema

Sound familiar? The root cause is almost always the same: you're treating a probabilistic text generator like a deterministic function.

Step 1: Lock Down Your Output Schema

The first fix is embarrassingly simple. Stop accepting free-text classification output.

python
import json
from pydantic import BaseModel, Field
from enum import Enum

class TriageCategory(str, Enum):
    CRITICAL = "critical"
    URGENT = "urgent"
    STANDARD = "standard"
    LOW = "low"

class ClassificationResult(BaseModel):
    category: TriageCategory
    confidence: float = Field(ge=0.0, le=1.0)
    reasoning: str = Field(max_length=500)
    # Forces the model to flag when it's unsure
    ambiguous: bool = False
    differential: list[TriageCategory] = []  # other possible categories

def validate_classification(raw_output: str) -> ClassificationResult:
    try:
        data = json.loads(raw_output)
        return ClassificationResult(**data)
    except (json.JSONDecodeError, ValueError) as e:
        # Don't silently fall back — route to human review
        raise ClassificationError(f"Model output failed validation: {e}")

The differential field is the key insight I stole from actual medical practice. When doctors aren't sure, they don't just pick one answer — they list the possibilities. Your model should do the same.

If you're using an API that supports structured outputs or function calling, use that instead of parsing raw text. It eliminates an entire class of formatting errors.

Step 2: Calibrate Confidence Scores (They're Lying to You)

Here's something that bit me hard. When you ask an LLM to self-report confidence, those numbers are essentially made up. A model that says it's 95% confident is not actually right 95% of the time.

python
import numpy as np
from collections import defaultdict

class ConfidenceCalibrator:
    """Post-hoc calibration using historical predictions vs. outcomes."""
    
    def __init__(self, n_bins: int = 10):
        self.n_bins = n_bins
        self.bin_boundaries = np.linspace(0, 1, n_bins + 1)
        self.calibration_map: dict[int, float] = {}
    
    def fit(self, predicted_confidences: list[float], actual_correct: list[bool]):
        """Build calibration curve from labeled evaluation data."""
        bins = defaultdict(list)
        
        for conf, correct in zip(predicted_confidences, actual_correct):
            bin_idx = int(np.digitize(conf, self.bin_boundaries)) - 1
            bin_idx = min(bin_idx, self.n_bins - 1)
            bins[bin_idx].append(correct)
        
        for bin_idx, outcomes in bins.items():
            # Actual accuracy for this confidence range
            self.calibration_map[bin_idx] = sum(outcomes) / len(outcomes)
    
    def calibrate(self, raw_confidence: float) -> float:
        """Map model's claimed confidence to actual observed accuracy."""
        bin_idx = int(np.digitize(raw_confidence, self.bin_boundaries)) - 1
        bin_idx = min(bin_idx, self.n_bins - 1)
        return self.calibration_map.get(bin_idx, raw_confidence)

In my experience, LLMs are consistently overconfident in the 0.7-0.9 range. After calibration, a lot of those "85% confident" predictions turned out to be correct about 60% of the time. That's a massive difference when you're routing decisions based on those numbers.

Step 3: Build a Human-in-the-Loop Escalation Path

This is where most teams cut corners, and it's where the Harvard study comparison gets interesting. The study compared AI-only vs. doctor-only. But in practice, the winning architecture is neither — it's AI + human with clear escalation rules.

python
class EscalationRouter:
    def __init__(self, calibrator: ConfidenceCalibrator, 
                 auto_threshold: float = 0.85,
                 reject_threshold: float = 0.5):
        self.calibrator = calibrator
        self.auto_threshold = auto_threshold
        self.reject_threshold = reject_threshold
    
    def route(self, result: ClassificationResult) -> str:
        calibrated = self.calibrator.calibrate(result.confidence)
        
        # High confidence + no ambiguity = auto-process
        if calibrated >= self.auto_threshold and not result.ambiguous:
            return "auto_accept"
        
        # Model flagged ambiguity or differential has close alternatives
        if result.ambiguous or len(result.differential) > 1:
            return "human_review_priority"
        
        # Low confidence = don't even try
        if calibrated < self.reject_threshold:
            return "human_review_required"
        
        # Middle ground: accept but flag for async audit
        return "auto_accept_with_audit"

The auto_accept_with_audit path is crucial. It lets you process the majority of clear-cut cases automatically while building a feedback dataset from the audited ones. After a few weeks, you've got labeled data to retrain your calibration curve.

Step 4: Use Eval-Driven Development, Not Vibes

The reason that Harvard study is useful isn't the headline number — it's that they had a clear evaluation methodology. Your classification system needs the same thing.

python
def run_eval_suite(classify_fn, test_cases: list[dict]) -> dict:
    results = {
        "total": len(test_cases),
        "correct": 0,
        "incorrect_but_flagged": 0,  # wrong, but model said ambiguous
        "incorrect_confident": 0,    # wrong AND confident — the scary ones
        "consistency": []             # same input, multiple runs
    }
    
    for case in test_cases:
        # Run each case 3 times to check consistency
        outputs = [classify_fn(case["input"]) for _ in range(3)]
        categories = [o.category for o in outputs]
        
        results["consistency"].append(len(set(categories)) == 1)
        
        # Use majority vote for accuracy check
        from collections import Counter
        majority = Counter(categories).most_common(1)[0][0]
        
        if majority == case["expected"]:
            results["correct"] += 1
        elif any(o.ambiguous for o in outputs):
            results["incorrect_but_flagged"] += 1
        else:
            results["incorrect_confident"] += 1
    
    results["consistency_rate"] = sum(results["consistency"]) / len(results["consistency"])
    return results

The metric I care about most isn't overall accuracy — it's incorrect_confident. That's the failure mode that causes real damage. A system that's wrong 20% of the time but flags uncertainty is infinitely more useful than one that's wrong 15% of the time but never tells you.

Prevention: The Production Checklist

Before you ship any LLM classification pipeline to production:

  • Structured output validation — never trust raw text parsing for critical paths
  • Calibrated confidence — run at least 200 labeled examples through calibration before going live
  • Escalation routing — define explicit thresholds for auto-accept, audit, and human-review
  • Consistency testing — if the same input gives different outputs on retry, your temperature is too high or your prompt is ambiguous
  • Eval suite in CI — run your test cases on every prompt change, every model version bump
  • Monitoring in production — track confidence distribution drift over time. If your model suddenly gets more confident or less confident across the board, something changed

The Bigger Picture

The headline "AI beats doctors" is reductive. What the research actually suggests is that LLMs are good at synthesizing patterns across large amounts of unstructured text — which is literally what they were built to do.

The developer takeaway isn't "replace humans with LLMs." It's that a well-built classification pipeline with proper calibration, structured outputs, and human escalation can outperform either humans or AI working alone.

Build the pipeline right, measure it honestly, and don't trust the confidence scores until you've calibrated them. That's it. That's the whole thing.

Why Your LLM Classification Pipeline Fails on Edge Cases (and How to Fix It) | Authon Blog