The problem: a fetch that never returns
Picture this. It's 11pm. PagerDuty is going off. Your dashboard shows half the requests from your Node service are just... sitting there. No error. No timeout. The connection is open, the spinner is spinning, and somewhere out there a user is staring at it wondering if your app is broken.
I hit this exact issue last month on a project that talks to a third-party pricing API. Everything was fine for weeks. Then one afternoon the upstream got slow, and our entire checkout flow started piling up requests that never resolved. Memory climbed. Eventually pods got OOMKilled.
Here's the kicker: the bug had been there since day one. We just got lucky.
Why fetch hangs
By default, fetch() has no timeout. None. Zero. If the server on the other end opens a connection and then never sends bytes (or the network drops without a clean FIN), your request will sit there until the OS-level socket eventually gives up — which, depending on your kernel settings, can be measured in minutes.
Here's a quick demo. Spin up a server that accepts connections but never responds:
// stall-server.js
import net from "node:net";
const server = net.createServer((socket) => {
// Accept the connection but never write anything back.
// The client's fetch() will hang indefinitely.
});
server.listen(3001, () => console.log("stalling on :3001"));Now hit it with plain fetch:
const res = await fetch("http://localhost:3001");
// Spoiler: you never get here
console.log(res.status);That await never resolves. Your event loop is fine — Node is happy to wait forever. But your application code is stuck.
The fix, version 1: AbortController
The standard way to bound a fetch is to pass an AbortSignal. You create a controller, set a timer that calls abort(), and pass the signal to fetch:
async function fetchWithTimeout(url, ms = 5000) {
const controller = new AbortController();
const id = setTimeout(() => controller.abort(), ms);
try {
return await fetch(url, { signal: controller.signal });
} finally {
// Always clear — otherwise you leak timer handles on success
clearTimeout(id);
}
}This works. It's been the canonical answer on Stack Overflow for years. But there are two papercuts:
- You have to remember to clear the timer or you leak
setTimeouthandles. - The thrown error is a generic
AbortError, so distinguishing "I cancelled this" from "the server timed out" requires extra bookkeeping.
The fix, version 2: AbortSignal.timeout()
If you're on a modern Node or any current browser, there's a much cleaner version. AbortSignal.timeout(ms) returns a signal that fires automatically. No controller, no clearTimeout, no leaks:
async function fetchWithTimeout(url, ms = 5000) {
return fetch(url, { signal: AbortSignal.timeout(ms) });
}That's the whole thing. When the timeout fires, fetch rejects with a TimeoutError (a DOMException whose name === "TimeoutError"), which is distinct from a user-initiated AbortError. According to the MDN docs for AbortSignal.timeout, that distinction is part of the spec — and it matters more than you'd think. You almost always want to retry timeouts but not user cancellations.
Catching it cleanly:
try {
const res = await fetchWithTimeout("https://api.example.com/things", 3000);
return await res.json();
} catch (err) {
if (err.name === "TimeoutError") {
// Server is slow — we can retry, fall back to cache, etc.
return getFromCache();
}
if (err.name === "AbortError") {
// Someone cancelled this intentionally — don't retry
throw err;
}
throw err; // Some other network failure
}Combining timeouts with retries
A single timeout isn't always enough. Real outages are flaky — the first attempt times out, the second works. So you usually want a small retry loop with exponential backoff:
async function fetchWithRetry(url, { tries = 3, timeout = 3000 } = {}) {
let lastErr;
for (let i = 0; i < tries; i++) {
try {
return await fetch(url, { signal: AbortSignal.timeout(timeout) });
} catch (err) {
lastErr = err;
// Don't retry user-initiated cancellations
if (err.name === "AbortError") throw err;
// Exponential backoff with jitter — avoid a thundering herd on recovery
const delay = Math.min(1000 * 2 ** i, 8000) + Math.random() * 200;
await new Promise((r) => setTimeout(r, delay));
}
}
throw lastErr;
}A couple of things that aren't obvious here:
- The jitter matters. Without it, every client in your fleet retries at the same instant, and the recovering upstream gets a synchronized stampede.
- Capping the backoff matters too. Without
Math.min, the third retry could be 8s and the fourth 16s — by then the user has closed the tab.
Combining multiple signals
One more gotcha. If your request is already tied to a user-cancellation signal (say, a component unmounting in React), you don't want the timeout to overwrite it. The newer helper AbortSignal.any([...]) fires when any of the input signals fire:
function fetchUserAware(url, userSignal, ms = 5000) {
const signal = AbortSignal.any([
userSignal,
AbortSignal.timeout(ms),
]);
return fetch(url, { signal });
}I haven't tested this exhaustively across older runtimes — AbortSignal.any is relatively recent. Check the MDN compatibility table for AbortSignal.any before relying on it.
Prevention tips
A few habits that have saved me a lot of pager fatigue:
- Never call
fetchwithout a timeout in server-side code. Browser-side it's slightly less critical because the user can refresh, but server-to-server calls without a bound are an outage waiting to happen. - Pick timeouts based on the dependency's actual SLO, not a vibe. If a service promises p99 < 800ms, your timeout should be ~2x that, not 30s.
- Distinguish error types in your retry logic. Retrying a 401 forever is just DoS-ing your auth server.
- Add metrics for timeouts. A sudden spike in
TimeoutErrorcount is one of the cleanest leading indicators of an upstream going sideways. - Test the hang case. Use a stall server like the one above in your integration tests — it's the only way to be sure your timeout actually fires.
The big picture: networks fail in two ways — fast and loud, or slow and quiet. The fast-and-loud failures surface immediately. The slow-and-quiet ones lurk in your code for months and then take down production at the worst possible moment. Bounding every external call is the cheapest insurance you can buy.
