Availability — SLAs, SLOs, and the Nines

Availability math — what the nines mean in real downtime, how redundancy multiplies uptime, and the difference between SLI, SLO, and SLA.

8 min read Level 3/5 #system-design#availability#reliability

What you'll learn

Translate "the nines" into real downtime budgets
Distinguish SLI, SLO, and SLA
Reason about how serial vs redundant components affect availability

“Highly available” is meaningless until you attach a number. That number is usually expressed in nines — 99.9%, 99.99%, 99.999% — and each extra nine is roughly 10× harder and more expensive to achieve. Knowing what the nines cost in real downtime is the difference between a target you can defend and a number you copied off a slide.

The nines, as actual downtime

Availability	Name	Downtime/year	Downtime/day
99%	“two nines”	~3.65 days	~14.4 min
99.9%	“three nines”	~8.76 hours	~1.44 min
99.99%	“four nines”	~52.6 min	~8.6 sec
99.999%	“five nines”	~5.26 min	~0.86 sec

Two nines sounds impressive and is nearly worthless — almost four days down a year. Four nines (~53 minutes/year) is the common target for serious consumer services. Five nines is telecom/infrastructure territory and is very expensive to reach and maintain.

Computing the downtime budget script.js

const MINUTES_PER_YEAR = 365 * 24 * 60; // 525,600

function downtimeBudget(availabilityPercent) {
  const allowedFraction = 1 - availabilityPercent / 100;
  const minutes = MINUTES_PER_YEAR * allowedFraction;
  return `${minutes.toFixed(1)} min/year`;
}

console.log(downtimeBudget(99.9));   // "525.6 min/year"  (~8.76 h)
console.log(downtimeBudget(99.99));  // "52.6 min/year"
console.log(downtimeBudget(99.999)); // "5.3 min/year"

▶ Preview: console

SLI vs SLO vs SLA

These three get used interchangeably and shouldn’t be:

Term	What it is	Example
SLI (Indicator)	A measured number	”99.97% of requests succeeded this month”
SLO (Objective)	Your internal target	”≥ 99.99% successful requests”
SLA (Agreement)	A contract with consequences	”Below 99.9%, customers get a refund”

The relationship: you measure SLIs, you aim for SLOs, and you promise SLAs. Teams almost always set the internal SLO stricter than the external SLA, so there’s a buffer before any contractual penalty. The gap between your SLO and 100% is your error budget — the amount of failure you’re allowed to spend on risky deploys and experiments.

Redundancy: how components combine

A system’s availability depends on how its parts are wired together.

Serial (a chain — all must work). If a request must pass through a load balancer and an app server and a database, their availabilities multiply, so the total is lower than any single part:

0.999 × 0.999 × 0.999 ≈ 0.997  →  three nines of parts gives ~99.7%

More dependencies in the critical path = lower availability. This is why chatty architectures with many synchronous hops are fragile.

Redundant (parallel — any one suffices). Put two app servers behind the load balancer and the system survives if either works. Now you multiply the failure probabilities:

Availability — SLAs, SLOs, and the Nines — architecture diagram

P(both down) = 0.01 × 0.01 = 0.0001
availability = 1 - 0.0001 = 99.99%

Two “two-nines” servers in parallel give four nines together. Redundancy is how you buy availability — but only for the part you actually duplicated. A single shared database behind those two app servers is still a single point of failure, and it caps the whole system at the database’s availability.

The JavaScript angle: health checks and graceful degradation

Redundancy only works if the load balancer can tell a healthy instance from a sick one — which means your Node service needs a health endpoint, and it should fail fast when a critical dependency is gone:

A health check that tells the truth script.js

// The load balancer polls this; a 503 takes the instance out of rotation.
app.get('/healthz', async (req, res) => {
  try {
    await db.ping();               // is our critical dependency reachable?
    res.status(200).json({ ok: true });
  } catch {
    res.status(503).json({ ok: false }); // pull me from the pool, don't route here
  }
});

// Graceful degradation: serve stale cache instead of erroring when the DB is down.
async function getProfile(id) {
  try {
    return await db.getProfile(id);
  } catch {
    return cache.getStale(id) ?? { id, degraded: true }; // partial > nothing
  }
}

▶ Preview: console

A service that returns a healthy 200 while its database is unreachable defeats the entire point of redundancy — the load balancer keeps sending traffic into a black hole. Honest health checks and graceful fallbacks are how availability math turns into actual uptime.

That completes the Foundations: a method, the scale math, the latency ratios, and the availability model. Next we start assembling real systems, beginning with the building blocks of scale.