Assume Every Dependency Will Fail — Then Survive It
Resilience Patterns
Retries with exponential backoff and jitter, timeouts, the circuit breaker state machine, and bulkheads — the patterns that keep one failure from becoming an outage.
What you'll learn
- Apply timeouts and capped backoff-with-jitter to remote calls
- Implement a circuit breaker and explain its three states
- Use bulkheads to isolate failures from spreading
In a distributed system, failure is the normal case, not the exception. A dependency will be slow, drop a connection, or return a 503 — not occasionally, but constantly, somewhere in the fleet, right now. Resilience patterns are the small, composable defenses that keep one sick dependency from cascading into a full outage. Four of them carry most of the weight: timeouts, retries, circuit breakers, and bulkheads.
Timeouts: never wait forever
The most important and most forgotten pattern. A call with no timeout will, on a bad day, hang until the connection itself gives up — which might be 30, 60, even 120 seconds. Meanwhile that request holds a connection, a worker, and memory. Enough hung calls and you’ve exhausted the pool, and now a slow dependency has become a down service.
Every remote call needs a deadline, and the deadline should be tight — a few seconds at most for a synchronous request path. A timeout that’s too long is almost as bad as no timeout at all.
Retries: with backoff and jitter, never naively
Transient failures — a dropped packet, a momentary 503, a leader election in progress — often succeed on a second try. So retrying is good. Retrying wrong is catastrophic. Two rules:
- Back off exponentially. Don’t retry immediately; wait
base × 2^attempt, capped at some ceiling. Hammering a struggling service with instant retries is how you turn a blip into an outage. - Add jitter. If a thousand clients all fail at the same instant and all back off by exactly the same amount, they retry in sync — a self-inflicted thundering herd. Randomizing each delay spreads them out.
const sleep = (ms) => new Promise((r) => setTimeout(r, ms));
async function withRetry(fn, { retries = 4, base = 100, cap = 2000 } = {}) {
for (let attempt = 0; ; attempt++) {
try {
return await fn();
} catch (err) {
if (attempt >= retries || !isRetryable(err)) throw err;
const backoff = Math.min(cap, base * 2 ** attempt);
const delay = Math.random() * backoff; // full jitter: 0..backoff
await sleep(delay);
}
}
}
// Only retry transient failures — never a 400 or a validation error.
const isRetryable = (err) =>
err.code === 'ECONNRESET' || [429, 502, 503, 504].includes(err.status); The circuit breaker
Retries help with transient failures. But when a dependency is hard down, retrying just piles load onto something that can’t recover. The circuit breaker detects sustained failure and stops calling the dependency entirely for a while — failing fast instead of failing slow.
It’s a small state machine with three states:
- Closed — normal. Calls pass through; failures are counted.
- Open — the breaker tripped. Calls fail instantly without touching the dependency, giving it room to recover. After a cooldown, move to half-open.
- Half-open — let one trial call through. If it succeeds, close the breaker (recovered). If it fails, open again and wait another cooldown.
class CircuitBreaker {
constructor(fn, { threshold = 5, cooldownMs = 10_000 } = {}) {
this.fn = fn;
this.threshold = threshold;
this.cooldownMs = cooldownMs;
this.failures = 0;
this.state = 'closed';
this.openedAt = 0;
this.probing = false; // true while the single half-open trial runs
}
async call(...args) {
if (this.state === 'open') {
if (Date.now() - this.openedAt < this.cooldownMs) {
throw new Error('circuit open'); // fail fast — don't even try
}
this.state = 'half-open'; // cooldown done: allow ONE trial
}
// Half-open admits a single probe; everyone else fails fast until it resolves.
if (this.state === 'half-open') {
if (this.probing) throw new Error('circuit half-open');
this.probing = true;
}
try {
const result = await this.fn(...args);
this.failures = 0;
this.state = 'closed'; // success resets the breaker
return result;
} catch (err) {
this.failures++;
if (this.state === 'half-open' || this.failures >= this.threshold) {
this.state = 'open'; // trip
this.openedAt = Date.now();
this.failures = 0; // start the next cycle clean
}
throw err;
} finally {
this.probing = false; // release the probe slot
}
}
} The payoff is twofold: the failing dependency gets breathing room, and your service stops wasting threads waiting on calls that are doomed anyway. Failing fast is a feature.
Bulkheads: isolate the blast radius
Named after a ship’s watertight compartments — flood one, the ship still floats. In software, a bulkhead limits how much of a shared resource any one dependency can consume, so a failure in one can’t starve the others.
The classic example: a single connection pool shared across all downstream calls. If the payments service hangs, every slow call to it grabs and holds connections until the pool is empty — and now calls to the healthy search service can’t get a connection either. The fix is separate, bounded pools (or concurrency limits) per dependency:
| Pattern | Without it | With it |
|---|---|---|
| Shared pool | One slow dependency drains all connections | Capped per dependency — others unaffected |
| Bounded concurrency | Unlimited in-flight calls pile up | Excess requests rejected fast, not queued forever |
The JavaScript angle
These patterns compose, and in Node you wrap them around any async function. A robust outbound call is timeout → breaker → retry, layered:
function withTimeout(promise, ms) {
const ac = new AbortController();
const timer = setTimeout(() => ac.abort(), ms);
return Promise.race([
promise(ac.signal),
new Promise((_, rej) =>
ac.signal.addEventListener('abort', () => rej(new Error('timeout')))),
]).finally(() => clearTimeout(timer));
}
// Production code uses a battle-tested library (e.g. `opossum` for breakers,
// `p-retry` for retries) — but the mechanics are exactly the above. These patterns protect you from downstream slowness. The next lesson handles the inverse Node-specific hazard: what happens when you produce data faster than something downstream can consume it — backpressure.