Watching the Artemis II crew stream video from the far side of the Moon this week, I couldn't help but think about latency. NASA engineers had to solve real-time communication across 240,000 miles of void. Meanwhile, I spent last Tuesday debugging why our dashboard's WebSocket connection dies every time a user's laptop goes to sleep.
Same energy, different scale.
If you've built anything with real-time data — chat apps, live dashboards, collaborative editors — you've hit this wall. The connection drops silently, the UI shows stale data, and users start filing bugs that say "it just stopped updating." Let's fix it for good.
The Root Cause: WebSockets Are Fragile by Design
Here's what most tutorials won't tell you: a WebSocket connection can die without either side knowing. TCP keepalives exist, but they operate at intervals of minutes (or hours, depending on the OS). In the meantime, your connection is a zombie — technically open, functionally dead.
This happens because of:
- NAT timeout — network devices silently drop idle connections after 30-60 seconds
- Mobile/WiFi transitions — the IP changes, the socket doesn't know
- Laptop sleep/wake cycles — the OS suspends the socket, and it never recovers cleanly
- Load balancer idle timeouts — AWS ALB defaults to 60 seconds, for example
The WebSocket spec has a close event, but it only fires when the connection is gracefully shut down. A dropped network? Radio silence. Your onclose handler never fires.
Step 1: Implement Application-Level Heartbeats
Don't rely on TCP keepalives or the WebSocket protocol's ping/pong frames (which many proxies strip anyway). Roll your own.
class ResilientSocket {
constructor(url) {
this.url = url;
this.heartbeatInterval = 25000; // 25 seconds — under most NAT timeouts
this.heartbeatTimeout = 10000; // server must respond within 10s
this.reconnectDelay = 1000;
this.maxReconnectDelay = 30000;
this.connect();
}
connect() {
this.ws = new WebSocket(this.url);
this.ws.onopen = () => {
this.reconnectDelay = 1000; // reset backoff on successful connection
this.startHeartbeat();
};
this.ws.onmessage = (event) => {
if (event.data === 'pong') {
this.clearHeartbeatTimeout();
return;
}
this.handleMessage(event);
};
this.ws.onclose = () => this.reconnect();
this.ws.onerror = () => this.ws.close(); // trigger onclose -> reconnect
}
startHeartbeat() {
this.stopHeartbeat();
this.pingTimer = setInterval(() => {
if (this.ws.readyState === WebSocket.OPEN) {
this.ws.send('ping');
// if no pong comes back, connection is dead
this.pongTimer = setTimeout(() => this.ws.close(), this.heartbeatTimeout);
}
}, this.heartbeatInterval);
}
clearHeartbeatTimeout() {
clearTimeout(this.pongTimer);
}
stopHeartbeat() {
clearInterval(this.pingTimer);
clearTimeout(this.pongTimer);
}
}The 25-second interval isn't arbitrary. Most NAT devices and load balancers time out idle connections between 30 and 60 seconds. Sending a heartbeat every 25 seconds keeps the connection alive through the NAT and detects dead connections within 35 seconds total.
Step 2: Add Exponential Backoff with Jitter
When your server goes down, you don't want 10,000 clients reconnecting simultaneously every second. That's a thundering herd, and it will keep your server down.
reconnect() {
this.stopHeartbeat();
// jitter prevents all clients from reconnecting at the exact same moment
const jitter = Math.random() * 1000;
const delay = Math.min(this.reconnectDelay + jitter, this.maxReconnectDelay);
setTimeout(() => this.connect(), delay);
// exponential backoff: 1s, 2s, 4s, 8s... capped at 30s
this.reconnectDelay = Math.min(this.reconnectDelay * 2, this.maxReconnectDelay);
}I've seen production incidents where removing the jitter caused a perfectly healthy server to buckle under reconnection storms. The jitter is not optional. Add it.
Step 3: Handle the Visibility API
This is the one most people miss. When a user switches tabs or their laptop sleeps, the browser may suspend timers and network activity. Your heartbeat stops, the connection dies, but your reconnection timer is also frozen.
// detect when the tab becomes visible again
document.addEventListener('visibilitychange', () => {
if (document.visibilityState === 'visible') {
// connection is likely dead after being backgrounded
if (this.ws.readyState !== WebSocket.OPEN) {
this.reconnect();
} else {
// force an immediate heartbeat to verify the connection is alive
this.ws.send('ping');
this.pongTimer = setTimeout(() => {
this.ws.close(); // dead connection, trigger reconnect
}, this.heartbeatTimeout);
}
}
});This single addition cut our "stale dashboard" bug reports by about 80%. Users would open their laptop, see the dashboard, and it would already be reconnecting before they noticed anything was wrong.
Step 4: Track State and Resync
Reconnecting is only half the battle. You also need to catch up on missed data. The simplest approach is to track the last event ID or timestamp you received.
handleMessage(event) {
const data = JSON.parse(event.data);
this.lastEventId = data.id; // track the last event we processed
this.onMessage(data);
}
connect() {
// include last event ID in the connection URL so the server can replay missed events
const url = this.lastEventId
? `${this.url}?after=${this.lastEventId}`
: this.url;
this.ws = new WebSocket(url);
// ... rest of setup
}This requires server-side support — you need to buffer recent events and replay them on reconnection. Redis Streams work well for this. If you're using something like PostgreSQL, LISTEN/NOTIFY combined with a short-lived event buffer table does the job.
Prevention Checklist
Before you ship your next WebSocket feature, run through this:
- Heartbeat interval under your shortest timeout — check your load balancer, CDN, and proxy configs for idle timeout values
- Exponential backoff with jitter on every reconnection path
- Visibility change handler to detect sleep/wake and tab switches
- Last-event-ID tracking so reconnections don't mean missed data
- Connection state in the UI — show users when they're disconnected instead of displaying stale data as if it's current
- Server-side ping/pong logging — track which clients are dropping and how often, so you can spot network-level issues
The Bigger Picture
NASA's Deep Space Network solves reliability across light-second delays with store-and-forward protocols and redundant communication paths. We don't need that level of engineering for a dashboard, but the principle is the same: assume the connection will fail and design for recovery, not prevention.
Every WebSocket tutorial shows you new WebSocket(url) and calls it a day. Real production WebSocket code is 80% reconnection logic and 20% actual message handling. Once you accept that, the code gets simpler — you stop trying to prevent disconnections and start making them invisible to users.
The full pattern above is roughly 100 lines of JavaScript with no dependencies. I've shipped it in four different projects now, and it handles everything from flaky coffee-shop WiFi to AWS availability zone failovers. Nothing fancy — just the boring stuff that actually matters.
