Why Your Flight Delay Tracker Shows Stale Data (And How to Fix It)

You built a flight status dashboard. It looks great. Users love the UI. Then someone tweets a screenshot showing your app says their flight is "on time" while they're literally sitting on a delayed plane at JFK. Cool.

I've been there. Twice. The problem isn't your frontend, your caching layer, or your websocket implementation. It's that real-time airport and flight data is significantly harder to get right than most developers expect.

Let me walk through why this happens and how to build something that actually reflects reality.

The Root Cause: FAA Data Is Messier Than You Think

Most flight tracking projects start by pulling from the FAA's public data sources — things like the Airport Status API or SWIM (System Wide Information Management). The assumption is: government data source = authoritative = accurate.

Not quite.

FAA delay data has a few gotchas that will burn you:

Ground Delay Programs (GDP) are reported at the program level, not per-flight. Your flight might be delayed 45 minutes due to a GDP, but the API won't tell you that directly.
Status updates lag by 5-20 minutes depending on the data source and time of day. During peak hours at busy airports, the lag gets worse.
Cancellations sometimes appear as delays first. A flight might show as "delayed 3 hours" before flipping to "cancelled" — and your app showed stale optimism the whole time.

Step 1: Don't Rely on a Single Data Source

The first fix is triangulating across multiple feeds. Here's a basic architecture I've used:

python

import asyncio
import aiohttp
from datetime import datetime, timedelta

class FlightStatusAggregator:
    def __init__(self):
        self.sources = [
            FAAAirportStatusSource(),
            ADSBExchangeSource(),   # ADS-B radio signals from actual aircraft
            AirlineAPISource(),      # some airlines expose semi-public APIs
        ]
    
    async def get_delay_info(self, airport_code: str) -> dict:
        # fetch from all sources concurrently
        tasks = [source.fetch(airport_code) for source in self.sources]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        # filter out failed sources — don't let one bad API kill your data
        valid = [r for r in results if not isinstance(r, Exception)]
        
        if not valid:
            return {"status": "unknown", "confidence": 0.0}
        
        return self._reconcile(valid)
    
    def _reconcile(self, reports: list) -> dict:
        # take the WORST reported status — optimistic defaults burn users
        delay_minutes = max(r.get("delay_minutes", 0) for r in reports)
        
        # confidence drops when sources disagree significantly
        spread = max(r.get("delay_minutes", 0) for r in reports) - \
                 min(r.get("delay_minutes", 0) for r in reports)
        confidence = max(0.3, 1.0 - (spread / 120))  # normalize against 2hr spread
        
        return {
            "delay_minutes": delay_minutes,
            "confidence": round(confidence, 2),
            "source_count": len(reports),
            "timestamp": datetime.utcnow().isoformat()
        }

The key insight: always bias toward the worst-case report. Users will forgive you for saying a flight is delayed when it's actually on time. They will not forgive the opposite.

Step 2: Use ADS-B Data as Ground Truth

ADS-B (Automatic Dependent Surveillance-Broadcast) is the radio signal that aircraft transmit with their position, altitude, and speed. Open-source projects like dump1090 and networks like ADS-B Exchange aggregate this from hobbyist receivers worldwide.

This is the closest thing to ground truth you'll get without working at an airline.

python

async def check_adsb_departure_status(flight_icao: str, 
                                       scheduled_departure: datetime) -> dict:
    """Check if a flight has actually departed by looking at ADS-B signals."""
    async with aiohttp.ClientSession() as session:
        # query an ADS-B aggregator for recent positions
        async with session.get(
            f"https://your-adsb-source/api/aircraft/{flight_icao}"
        ) as resp:
            data = await resp.json()
    
    if not data.get("positions"):
        # no ADS-B signal — plane is likely still on the ground
        if datetime.utcnow() > scheduled_departure + timedelta(minutes=15):
            return {"status": "likely_delayed", "airborne": False}
        return {"status": "waiting", "airborne": False}
    
    latest = data["positions"][-1]
    
    # altitude check — taxiing planes are below ~200ft AGL
    if latest["alt_baro"] > 2000:
        return {"status": "departed", "airborne": True}
    
    return {"status": "taxiing", "airborne": False}

This approach lets you catch a common failure mode: the official API says "departed" but the plane is actually still taxiing. That 20-minute gap matters to people waiting at the arrival gate.

Step 3: Cache Smart, Not Hard

The instinct is to cache aggressively because you're hitting rate-limited APIs. But stale cache is exactly how you end up showing "on time" for a delayed flight.

Here's what actually works:

python

import time

class AdaptiveTTLCache:
    """Cache that shortens TTL when delays are detected."""
    
    def __init__(self, default_ttl=300):  # 5 min default
        self.store = {}
        self.default_ttl = default_ttl
    
    def get(self, key: str):
        if key not in self.store:
            return None
        entry = self.store[key]
        if time.time() > entry["expires_at"]:
            del self.store[key]
            return None
        return entry["value"]
    
    def set(self, key: str, value: dict):
        # if there's an active delay, cache for much less time
        # delays change fast — a 30min delay can become 2hrs quickly
        if value.get("delay_minutes", 0) > 0:
            ttl = 60  # 1 minute when delays are active
        elif value.get("confidence", 1.0) < 0.7:
            ttl = 90  # sources disagree, check again soon
        else:
            ttl = self.default_ttl
        
        self.store[key] = {
            "value": value,
            "expires_at": time.time() + ttl
        }

The idea is simple: when things are normal, cache longer. When things are disrupted, cache shorter. This keeps your API costs reasonable on calm days while staying responsive during weather events when accuracy matters most.

Step 4: Show Your Uncertainty

This is the one most developers skip. If your confidence score is low — say, your sources disagree or you're only getting data from one feed — tell the user.

Don't show a confident green "On Time" badge when you're actually guessing. A simple "Last updated 12 min ago — status may have changed" goes a long way. Users can handle uncertainty. What they can't handle is false confidence.

Prevention: Monitor the Monitors

Set up alerts for when your data sources go stale:

Track the timestamp of the last successful fetch per source
Alert if any source hasn't returned fresh data in 2x its normal interval
Log the disagreement rate between sources — if it spikes, something is wrong
Monitor your cache hit rate during known disruption events (storms, ATC issues) to make sure your adaptive TTL is actually kicking in

The Bigger Lesson

Real-time data systems are hard not because of the "real-time" part — websockets and streaming are well-solved problems. They're hard because the upstream data is unreliable, inconsistent, and delayed in ways that aren't documented.

The fix is never just "poll faster." It's building a reconciliation layer that treats every data source as potentially wrong, biases toward the worst case for user-facing status, and is honest about its own uncertainty.

That's not just a flight tracking lesson. That's a distributed systems lesson.