Availability — SLAs, SLOs, and the Nines

How Much Downtime Is "Four Nines" Anyway?

Availability — SLAs, SLOs, and the Nines

Availability math — what the nines mean in real downtime, how redundancy multiplies uptime, and the difference between SLI, SLO, and SLA.

8 min read Level 3/5 #system-design#availability#reliability
What you'll learn
  • Translate "the nines" into real downtime budgets
  • Distinguish SLI, SLO, and SLA
  • Reason about how serial vs redundant components affect availability

“Highly available” is meaningless until you attach a number. That number is usually expressed in nines — 99.9%, 99.99%, 99.999% — and each extra nine is roughly 10× harder and more expensive to achieve. Knowing what the nines cost in real downtime is the difference between a target you can defend and a number you copied off a slide.

The nines, as actual downtime

AvailabilityNameDowntime/yearDowntime/day
99%“two nines”~3.65 days~14.4 min
99.9%“three nines”~8.76 hours~1.44 min
99.99%“four nines”~52.6 min~8.6 sec
99.999%“five nines”~5.26 min~0.86 sec

Two nines sounds impressive and is nearly worthless — almost four days down a year. Four nines (~53 minutes/year) is the common target for serious consumer services. Five nines is telecom/infrastructure territory and is very expensive to reach and maintain.

Computing the downtime budget script.js
const MINUTES_PER_YEAR = 365 * 24 * 60; // 525,600

function downtimeBudget(availabilityPercent) {
  const allowedFraction = 1 - availabilityPercent / 100;
  const minutes = MINUTES_PER_YEAR * allowedFraction;
  return `${minutes.toFixed(1)} min/year`;
}

console.log(downtimeBudget(99.9));   // "525.6 min/year"  (~8.76 h)
console.log(downtimeBudget(99.99));  // "52.6 min/year"
console.log(downtimeBudget(99.999)); // "5.3 min/year"
▶ Preview: console

SLI vs SLO vs SLA

These three get used interchangeably and shouldn’t be:

TermWhat it isExample
SLI (Indicator)A measured number”99.97% of requests succeeded this month”
SLO (Objective)Your internal target”≥ 99.99% successful requests”
SLA (Agreement)A contract with consequences”Below 99.9%, customers get a refund”

The relationship: you measure SLIs, you aim for SLOs, and you promise SLAs. Teams almost always set the internal SLO stricter than the external SLA, so there’s a buffer before any contractual penalty. The gap between your SLO and 100% is your error budget — the amount of failure you’re allowed to spend on risky deploys and experiments.

Redundancy: how components combine

A system’s availability depends on how its parts are wired together.

Serial (a chain — all must work). If a request must pass through a load balancer and an app server and a database, their availabilities multiply, so the total is lower than any single part:

0.999 × 0.999 × 0.999 ≈ 0.997  →  three nines of parts gives ~99.7%

More dependencies in the critical path = lower availability. This is why chatty architectures with many synchronous hops are fragile.

Redundant (parallel — any one suffices). Put two app servers behind the load balancer and the system survives if either works. Now you multiply the failure probabilities:

Availability — SLAs, SLOs, and the Nines — architecture diagram
P(both down) = 0.01 × 0.01 = 0.0001
availability = 1 - 0.0001 = 99.99%

Two “two-nines” servers in parallel give four nines together. Redundancy is how you buy availability — but only for the part you actually duplicated. A single shared database behind those two app servers is still a single point of failure, and it caps the whole system at the database’s availability.

The JavaScript angle: health checks and graceful degradation

Redundancy only works if the load balancer can tell a healthy instance from a sick one — which means your Node service needs a health endpoint, and it should fail fast when a critical dependency is gone:

A health check that tells the truth script.js
// The load balancer polls this; a 503 takes the instance out of rotation.
app.get('/healthz', async (req, res) => {
  try {
    await db.ping();               // is our critical dependency reachable?
    res.status(200).json({ ok: true });
  } catch {
    res.status(503).json({ ok: false }); // pull me from the pool, don't route here
  }
});

// Graceful degradation: serve stale cache instead of erroring when the DB is down.
async function getProfile(id) {
  try {
    return await db.getProfile(id);
  } catch {
    return cache.getStale(id) ?? { id, degraded: true }; // partial > nothing
  }
}
▶ Preview: console

A service that returns a healthy 200 while its database is unreachable defeats the entire point of redundancy — the load balancer keeps sending traffic into a black hole. Honest health checks and graceful fallbacks are how availability math turns into actual uptime.

That completes the Foundations: a method, the scale math, the latency ratios, and the availability model. Next we start assembling real systems, beginning with the building blocks of scale.