Why your LLM SSE stream dies after 60 seconds (and how to actually fix it)

The problem nobody warned you about

So you wired up token streaming for your LLM-powered app. Tokens flow nicely in dev. You ship it. Within a day, users start reporting that long generations get cut off — sometimes mid-sentence, sometimes mid-code-block — and the connection just... dies. No error, no nothing. The frontend just stops receiving events.

I hit this exact problem on three different projects last year. Each time, I blamed the model provider. Each time, I was wrong.

What's actually happening

Server-Sent Events (SSE) over HTTP/1.1 keep a long-lived connection open. Your app server happily streams chunks as the model generates them. The problem is everything sitting between your app and the user's browser.

A typical request path looks like this: Browser → CDN → Load Balancer → Reverse Proxy → App Server → upstream model API. Every hop has its own idle-timeout setting. If any of them sees no activity for N seconds, it kills the connection.

Default idle timeouts I've actually seen bite people in production:

nginx proxy_read_timeout: 60 seconds
AWS ALB idle timeout: 60 seconds
Cloudflare free tier: 100 seconds
Heroku router: 30 seconds for the first byte, 55 between bytes

Now think about a generation that takes 90 seconds. If the model pauses for 65 seconds while doing tool use or extended reasoning, the proxy assumes the connection is dead and tears it down. Your app server keeps writing into the void.

Reproducing it locally

The annoying part is this never happens in dev because you're going browser → localhost. No proxies. To reproduce it, throw an nginx in front with an aggressive timeout:

nginx

# nginx.conf — reproduce the production failure mode locally
location /api/stream {
    proxy_pass http://localhost:3000;
    proxy_read_timeout 10s;   # aggressively short to trigger the bug fast
    proxy_buffering off;      # we'll come back to this one
}

Now any stream with a 10+ second gap dies. Predictable. Debuggable.

The fix, step by step

There are three things to do, and you need all three.

1. Send heartbeats from the server

The simplest fix: emit a no-op every ~15 seconds so the connection never looks idle. SSE comment lines (starting with :) are perfect — the spec says clients must ignore them.

javascript

// Node.js / Express
app.get('/api/stream', async (req, res) => {
  res.writeHead(200, {
    'Content-Type': 'text/event-stream',
    'Cache-Control': 'no-cache, no-transform',
    'Connection': 'keep-alive',
    'X-Accel-Buffering': 'no',  // disable nginx buffering per-response
  });

  // SSE comments are silently ignored by EventSource — perfect heartbeat
  const heartbeat = setInterval(() => {
    res.write(': ping\n\n');
  }, 15000);

  try {
    for await (const chunk of callModel(req.body.prompt)) {
      res.write(`data: ${JSON.stringify(chunk)}\n\n`);
    }
  } finally {
    clearInterval(heartbeat); // critical: don't leak the interval on error
    res.end();
  }
});

Two non-obvious bits in there:

X-Accel-Buffering: no tells nginx not to buffer this specific response. Without it, nginx may hold your tokens until it accumulates a full buffer, which destroys the streaming UX even when the connection survives.
The finally block matters. If the model iterator throws, you'll leak the interval forever otherwise. I've shipped this bug. Don't be me.

2. Disable proxy buffering globally

Heartbeats only help if the proxy actually forwards them in real time. nginx buffers by default, which means your heartbeat sits in a buffer until enough data accumulates.

nginx

location /api/stream {
    proxy_pass http://localhost:3000;
    proxy_buffering off;       # forward bytes as they arrive
    proxy_cache off;           # don't cache streams, ever
    proxy_read_timeout 600s;   # generous timeout as defense in depth
    chunked_transfer_encoding on;
}

If you're on a managed platform, dig into the docs for how to extend idle timeouts. AWS ALB lets you bump it to 4000 seconds. Cloudflare's free tier caps you around 100 seconds, which is a real constraint to plan around — not something you can config your way out of.

3. Reconnect on the client

Even with all of the above, networks fail. Mobile users tunnel through bad cell. Laptops sleep mid-generation. Your client should resume cleanly.

The native EventSource reconnects automatically but doesn't tell the server where to pick up. You need to track the last event ID and replay from there:

javascript

// client.js
let lastEventId = null;

function connect(prompt) {
  const url = `/api/stream?prompt=${encodeURIComponent(prompt)}` +
              (lastEventId ? `&resume=${lastEventId}` : '');
  const source = new EventSource(url);

  source.onmessage = (e) => {
    lastEventId = e.lastEventId;   // remember position
    appendToken(JSON.parse(e.data));
  };

  source.onerror = () => {
    source.close();
    // EventSource's built-in retry is opaque; rolling our own is clearer
    setTimeout(() => connect(prompt), 1000);
  };
}

On the server, when resume is present, you need to either replay from a cache or — more practically — record what was already emitted and skip ahead in the new generation. For most apps, "regenerate from the original prompt and skip emitted tokens" is good enough. Full deterministic resumption is a much bigger project than it looks.

Prevention checklist

After being burned by this enough times, here's my standard list for any new streaming endpoint:

Heartbeat interval less than half the shortest proxy timeout in the chain
X-Accel-Buffering: no on every streaming response
proxy_buffering off in nginx for streaming routes
Idle timeouts at every layer raised to at least 2x your worst-case stream duration
A synthetic monitor that pings the streaming endpoint with a known-slow prompt and alerts if total stream duration drops below expected
Reconnection logic on the client with bounded backoff

The synthetic monitor is the one most teams skip and the one I now consider non-negotiable. The whole class of bug is silent — no 500s, no error logs, just confused users. You need an external prober to catch it.

Why this is worth getting right

Streaming UX is one of the biggest perceived-performance wins for model-powered apps. Time-to-first-token is what users feel. When the stream silently dies at the 90% mark, they see a half-finished answer with no way to retry, and they click away.

The fix isn't glamorous and it's mostly config. But spending an afternoon getting the proxy and heartbeat story right will save you from a long tail of "is the AI broken?" support tickets that are actually "is your reverse proxy broken."