Fault Tolerance

Redundancy and failover, quorum and replication factor, removing single points of failure, and degrading gracefully instead of going dark.

8 min read Level 3/5 #system-design#fault-tolerance#redundancy

What you'll learn

Use redundancy and failover to survive component loss
Reason about replication factor and quorum
Apply graceful degradation and fallbacks

Fault tolerance is the property that a system keeps working when some of its parts don’t. Not “never fails” — that’s impossible at scale, where something is always broken somewhere. The goal is that individual failures stay invisible to users. The availability lesson gave you the math; this lesson gives you the mechanisms that produce those nines in practice.

The throughline is one idea from the availability lesson: redundancy multiplies uptime, but only for the part you actually duplicate. Everything here is about finding the parts that aren’t redundant and fixing them.

Redundancy and failover

Redundancy means running more than one of something so that losing one doesn’t lose the capability. Failover is the act of switching to the spare when the primary dies. The two go together — redundancy without automatic failover just means you have a spare you have to swap in by hand at 3am.

The patterns differ by how the spare is kept warm:

Active-active — all replicas serve traffic. Losing one just sheds some capacity. Best availability, but every node must handle writes safely (harder).
Active-passive — one primary serves; a standby stays in sync, ready to be promoted. Simpler, but failover takes seconds-to-minutes and you’re paying for idle hardware.

The hard part of failover is detecting failure correctly. The health checks from the availability lesson are what trigger it — and they must be honest, or the load balancer keeps routing into a dead node.

Single points of failure

A single point of failure (SPOF) is any component with no redundancy whose loss takes down the whole system. The discipline of fault tolerance is, at heart, hunting these down. The sneaky thing is that adding redundancy in one place often just moves the SPOF somewhere else:

You add redundancy to…	The SPOF moves to…
App servers (two behind an LB)	The load balancer itself
The load balancer (a pair)	DNS / the single region
The region (multi-region)	The shared primary database
The database (replicas)	The failover coordinator

Replication factor and quorum

For data, redundancy means replication: keep N copies of each piece of data, so losing a node doesn’t lose the data. N is the replication factor — typically 3.

But replicas raise a question: if you write to some copies and read from others, how do you guarantee a read sees the latest write? The answer is quorum. Define:

N = number of replicas
W = replicas that must acknowledge a write before it’s “committed”
R = replicas you read from and compare

If W + R > N, the read set and write set are guaranteed to overlap by at least one node — so a read always sees at least one copy with the newest value. With N = 3, choosing W = 2, R = 2 gives 4 > 3: you can lose any one node and still serve correct reads and writes.

Checking a quorum configuration script.js

function isStrongQuorum({ N, W, R }) {
  return W + R > N; // overlap guarantees a read sees the latest write
}

// 3 replicas, write to 2, read from 2 → strongly consistent, tolerates 1 loss.
console.log(isStrongQuorum({ N: 3, W: 2, R: 2 })); // true

// Tune for fast writes (W=1): cheap writes, but reads may be stale.
console.log(isStrongQuorum({ N: 3, W: 1, R: 1 })); // false — eventual

▶ Preview: console

Tuning W and R lets you slide between fast-but-eventual and slow-but-strong — the consistency tradeoff, with knobs. Lower W = faster, more available writes that risk staleness; higher W = slower writes that are safer.

Graceful degradation

The last and most user-facing layer: when something does fail, degrade instead of dying. A partial response beats an error page nearly every time.

Recommendations service down? Show a generic popular-items list, not a blank page.
Can’t reach the live inventory count? Show “in stock” from a slightly stale cache rather than failing the whole product page.
Search backend overloaded? Disable fuzzy matching, keep exact matching alive.

This is the fallback pattern from the availability lesson, generalized: every non-critical dependency should have a “good enough” answer ready for when it’s unreachable. Decide in advance what each feature degrades to.

Redundancy and failover both assume the survivors can agree on who’s in charge now — which replica is the new primary. That coordination problem is leader election, next.