Resilience Patterns

Retries with exponential backoff and jitter, timeouts, the circuit breaker state machine, and bulkheads — the patterns that keep one failure from becoming an outage.

9 min read Level 3/5 #system-design#resilience#circuit-breaker

What you'll learn

Apply timeouts and capped backoff-with-jitter to remote calls
Implement a circuit breaker and explain its three states
Use bulkheads to isolate failures from spreading

In a distributed system, failure is the normal case, not the exception. A dependency will be slow, drop a connection, or return a 503 — not occasionally, but constantly, somewhere in the fleet, right now. Resilience patterns are the small, composable defenses that keep one sick dependency from cascading into a full outage. Four of them carry most of the weight: timeouts, retries, circuit breakers, and bulkheads.

Timeouts: never wait forever

The most important and most forgotten pattern. A call with no timeout will, on a bad day, hang until the connection itself gives up — which might be 30, 60, even 120 seconds. Meanwhile that request holds a connection, a worker, and memory. Enough hung calls and you’ve exhausted the pool, and now a slow dependency has become a down service.

Every remote call needs a deadline, and the deadline should be tight — a few seconds at most for a synchronous request path. A timeout that’s too long is almost as bad as no timeout at all.

Retries: with backoff and jitter, never naively

Transient failures — a dropped packet, a momentary 503, a leader election in progress — often succeed on a second try. So retrying is good. Retrying wrong is catastrophic. Two rules:

Back off exponentially. Don’t retry immediately; wait base × 2^attempt, capped at some ceiling. Hammering a struggling service with instant retries is how you turn a blip into an outage.
Add jitter. If a thousand clients all fail at the same instant and all back off by exactly the same amount, they retry in sync — a self-inflicted thundering herd. Randomizing each delay spreads them out.

Exponential backoff with full jitter script.js

const sleep = (ms) => new Promise((r) => setTimeout(r, ms));

async function withRetry(fn, { retries = 4, base = 100, cap = 2000 } = {}) {
  for (let attempt = 0; ; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (attempt >= retries || !isRetryable(err)) throw err;
      const backoff = Math.min(cap, base * 2 ** attempt);
      const delay = Math.random() * backoff; // full jitter: 0..backoff
      await sleep(delay);
    }
  }
}

// Only retry transient failures — never a 400 or a validation error.
const isRetryable = (err) =>
  err.code === 'ECONNRESET' || [429, 502, 503, 504].includes(err.status);

▶ Preview: console

The circuit breaker

Retries help with transient failures. But when a dependency is hard down, retrying just piles load onto something that can’t recover. The circuit breaker detects sustained failure and stops calling the dependency entirely for a while — failing fast instead of failing slow.

It’s a small state machine with three states:

Resilience Patterns — architecture diagram

Closed — normal. Calls pass through; failures are counted.
Open — the breaker tripped. Calls fail instantly without touching the dependency, giving it room to recover. After a cooldown, move to half-open.
Half-open — let one trial call through. If it succeeds, close the breaker (recovered). If it fails, open again and wait another cooldown.

A tiny circuit breaker script.js

class CircuitBreaker {
  constructor(fn, { threshold = 5, cooldownMs = 10_000 } = {}) {
    this.fn = fn;
    this.threshold = threshold;
    this.cooldownMs = cooldownMs;
    this.failures = 0;
    this.state = 'closed';
    this.openedAt = 0;
    this.probing = false;              // true while the single half-open trial runs
  }

  async call(...args) {
    if (this.state === 'open') {
      if (Date.now() - this.openedAt < this.cooldownMs) {
        throw new Error('circuit open'); // fail fast — don't even try
      }
      this.state = 'half-open';          // cooldown done: allow ONE trial
    }

    // Half-open admits a single probe; everyone else fails fast until it resolves.
    if (this.state === 'half-open') {
      if (this.probing) throw new Error('circuit half-open');
      this.probing = true;
    }

    try {
      const result = await this.fn(...args);
      this.failures = 0;
      this.state = 'closed';             // success resets the breaker
      return result;
    } catch (err) {
      this.failures++;
      if (this.state === 'half-open' || this.failures >= this.threshold) {
        this.state = 'open';             // trip
        this.openedAt = Date.now();
        this.failures = 0;               // start the next cycle clean
      }
      throw err;
    } finally {
      this.probing = false;              // release the probe slot
    }
  }
}

▶ Preview: console

The payoff is twofold: the failing dependency gets breathing room, and your service stops wasting threads waiting on calls that are doomed anyway. Failing fast is a feature.

Bulkheads: isolate the blast radius

Named after a ship’s watertight compartments — flood one, the ship still floats. In software, a bulkhead limits how much of a shared resource any one dependency can consume, so a failure in one can’t starve the others.

The classic example: a single connection pool shared across all downstream calls. If the payments service hangs, every slow call to it grabs and holds connections until the pool is empty — and now calls to the healthy search service can’t get a connection either. The fix is separate, bounded pools (or concurrency limits) per dependency:

Pattern	Without it	With it
Shared pool	One slow dependency drains all connections	Capped per dependency — others unaffected
Bounded concurrency	Unlimited in-flight calls pile up	Excess requests rejected fast, not queued forever

The JavaScript angle

These patterns compose, and in Node you wrap them around any async function. A robust outbound call is timeout → breaker → retry, layered:

A timeout wrapper to layer with the rest script.js

function withTimeout(promise, ms) {
  const ac = new AbortController();
  const timer = setTimeout(() => ac.abort(), ms);
  return Promise.race([
    promise(ac.signal),
    new Promise((_, rej) =>
      ac.signal.addEventListener('abort', () => rej(new Error('timeout')))),
  ]).finally(() => clearTimeout(timer));
}

// Production code uses a battle-tested library (e.g. `opossum` for breakers,
// `p-retry` for retries) — but the mechanics are exactly the above.

▶ Preview: console

These patterns protect you from downstream slowness. The next lesson handles the inverse Node-specific hazard: what happens when you produce data faster than something downstream can consume it — backpressure.