The problem nobody warned you about
So you wired up token streaming for your LLM-powered app. Tokens flow nicely in dev. You ship it. Within a day, users start reporting that long generations get cut off — sometimes mid-sentence, sometimes mid-code-block — and the connection just... dies. No error, no nothing. The frontend just stops receiving events.
I hit this exact problem on three different projects last year. Each time, I blamed the model provider. Each time, I was wrong.
What's actually happening
Server-Sent Events (SSE) over HTTP/1.1 keep a long-lived connection open. Your app server happily streams chunks as the model generates them. The problem is everything sitting between your app and the user's browser.
A typical request path looks like this: Browser → CDN → Load Balancer → Reverse Proxy → App Server → upstream model API. Every hop has its own idle-timeout setting. If any of them sees no activity for N seconds, it kills the connection.
Default idle timeouts I've actually seen bite people in production:
- nginx
proxy_read_timeout: 60 seconds - AWS ALB idle timeout: 60 seconds
- Cloudflare free tier: 100 seconds
- Heroku router: 30 seconds for the first byte, 55 between bytes
Now think about a generation that takes 90 seconds. If the model pauses for 65 seconds while doing tool use or extended reasoning, the proxy assumes the connection is dead and tears it down. Your app server keeps writing into the void.
Reproducing it locally
The annoying part is this never happens in dev because you're going browser → localhost. No proxies. To reproduce it, throw an nginx in front with an aggressive timeout:
# nginx.conf — reproduce the production failure mode locally
location /api/stream {
proxy_pass http://localhost:3000;
proxy_read_timeout 10s; # aggressively short to trigger the bug fast
proxy_buffering off; # we'll come back to this one
}Now any stream with a 10+ second gap dies. Predictable. Debuggable.
The fix, step by step
There are three things to do, and you need all three.
1. Send heartbeats from the server
The simplest fix: emit a no-op every ~15 seconds so the connection never looks idle. SSE comment lines (starting with :) are perfect — the spec says clients must ignore them.
// Node.js / Express
app.get('/api/stream', async (req, res) => {
res.writeHead(200, {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache, no-transform',
'Connection': 'keep-alive',
'X-Accel-Buffering': 'no', // disable nginx buffering per-response
});
// SSE comments are silently ignored by EventSource — perfect heartbeat
const heartbeat = setInterval(() => {
res.write(': ping\n\n');
}, 15000);
try {
for await (const chunk of callModel(req.body.prompt)) {
res.write(`data: ${JSON.stringify(chunk)}\n\n`);
}
} finally {
clearInterval(heartbeat); // critical: don't leak the interval on error
res.end();
}
});Two non-obvious bits in there:
X-Accel-Buffering: notells nginx not to buffer this specific response. Without it, nginx may hold your tokens until it accumulates a full buffer, which destroys the streaming UX even when the connection survives.- The
finallyblock matters. If the model iterator throws, you'll leak the interval forever otherwise. I've shipped this bug. Don't be me.
2. Disable proxy buffering globally
Heartbeats only help if the proxy actually forwards them in real time. nginx buffers by default, which means your heartbeat sits in a buffer until enough data accumulates.
location /api/stream {
proxy_pass http://localhost:3000;
proxy_buffering off; # forward bytes as they arrive
proxy_cache off; # don't cache streams, ever
proxy_read_timeout 600s; # generous timeout as defense in depth
chunked_transfer_encoding on;
}If you're on a managed platform, dig into the docs for how to extend idle timeouts. AWS ALB lets you bump it to 4000 seconds. Cloudflare's free tier caps you around 100 seconds, which is a real constraint to plan around — not something you can config your way out of.
3. Reconnect on the client
Even with all of the above, networks fail. Mobile users tunnel through bad cell. Laptops sleep mid-generation. Your client should resume cleanly.
The native EventSource reconnects automatically but doesn't tell the server where to pick up. You need to track the last event ID and replay from there:
// client.js
let lastEventId = null;
function connect(prompt) {
const url = `/api/stream?prompt=${encodeURIComponent(prompt)}` +
(lastEventId ? `&resume=${lastEventId}` : '');
const source = new EventSource(url);
source.onmessage = (e) => {
lastEventId = e.lastEventId; // remember position
appendToken(JSON.parse(e.data));
};
source.onerror = () => {
source.close();
// EventSource's built-in retry is opaque; rolling our own is clearer
setTimeout(() => connect(prompt), 1000);
};
}On the server, when resume is present, you need to either replay from a cache or — more practically — record what was already emitted and skip ahead in the new generation. For most apps, "regenerate from the original prompt and skip emitted tokens" is good enough. Full deterministic resumption is a much bigger project than it looks.
Prevention checklist
After being burned by this enough times, here's my standard list for any new streaming endpoint:
- Heartbeat interval less than half the shortest proxy timeout in the chain
X-Accel-Buffering: noon every streaming responseproxy_buffering offin nginx for streaming routes- Idle timeouts at every layer raised to at least 2x your worst-case stream duration
- A synthetic monitor that pings the streaming endpoint with a known-slow prompt and alerts if total stream duration drops below expected
- Reconnection logic on the client with bounded backoff
The synthetic monitor is the one most teams skip and the one I now consider non-negotiable. The whole class of bug is silent — no 500s, no error logs, just confused users. You need an external prober to catch it.
Why this is worth getting right
Streaming UX is one of the biggest perceived-performance wins for model-powered apps. Time-to-first-token is what users feel. When the stream silently dies at the 90% mark, they see a half-finished answer with no way to retry, and they click away.
The fix isn't glamorous and it's mostly config. But spending an afternoon getting the proxy and heartbeat story right will save you from a long tail of "is the AI broken?" support tickets that are actually "is your reverse proxy broken."
