How Much Downtime Is "Four Nines" Anyway?
Availability — SLAs, SLOs, and the Nines
Availability math — what the nines mean in real downtime, how redundancy multiplies uptime, and the difference between SLI, SLO, and SLA.
What you'll learn
- Translate "the nines" into real downtime budgets
- Distinguish SLI, SLO, and SLA
- Reason about how serial vs redundant components affect availability
“Highly available” is meaningless until you attach a number. That number is usually expressed in nines — 99.9%, 99.99%, 99.999% — and each extra nine is roughly 10× harder and more expensive to achieve. Knowing what the nines cost in real downtime is the difference between a target you can defend and a number you copied off a slide.
The nines, as actual downtime
| Availability | Name | Downtime/year | Downtime/day |
|---|---|---|---|
| 99% | “two nines” | ~3.65 days | ~14.4 min |
| 99.9% | “three nines” | ~8.76 hours | ~1.44 min |
| 99.99% | “four nines” | ~52.6 min | ~8.6 sec |
| 99.999% | “five nines” | ~5.26 min | ~0.86 sec |
Two nines sounds impressive and is nearly worthless — almost four days down a year. Four nines (~53 minutes/year) is the common target for serious consumer services. Five nines is telecom/infrastructure territory and is very expensive to reach and maintain.
const MINUTES_PER_YEAR = 365 * 24 * 60; // 525,600
function downtimeBudget(availabilityPercent) {
const allowedFraction = 1 - availabilityPercent / 100;
const minutes = MINUTES_PER_YEAR * allowedFraction;
return `${minutes.toFixed(1)} min/year`;
}
console.log(downtimeBudget(99.9)); // "525.6 min/year" (~8.76 h)
console.log(downtimeBudget(99.99)); // "52.6 min/year"
console.log(downtimeBudget(99.999)); // "5.3 min/year" SLI vs SLO vs SLA
These three get used interchangeably and shouldn’t be:
| Term | What it is | Example |
|---|---|---|
| SLI (Indicator) | A measured number | ”99.97% of requests succeeded this month” |
| SLO (Objective) | Your internal target | ”≥ 99.99% successful requests” |
| SLA (Agreement) | A contract with consequences | ”Below 99.9%, customers get a refund” |
The relationship: you measure SLIs, you aim for SLOs, and you promise SLAs. Teams almost always set the internal SLO stricter than the external SLA, so there’s a buffer before any contractual penalty. The gap between your SLO and 100% is your error budget — the amount of failure you’re allowed to spend on risky deploys and experiments.
Redundancy: how components combine
A system’s availability depends on how its parts are wired together.
Serial (a chain — all must work). If a request must pass through a load balancer and an app server and a database, their availabilities multiply, so the total is lower than any single part:
0.999 × 0.999 × 0.999 ≈ 0.997 → three nines of parts gives ~99.7%
More dependencies in the critical path = lower availability. This is why chatty architectures with many synchronous hops are fragile.
Redundant (parallel — any one suffices). Put two app servers behind the load balancer and the system survives if either works. Now you multiply the failure probabilities:
P(both down) = 0.01 × 0.01 = 0.0001
availability = 1 - 0.0001 = 99.99%
Two “two-nines” servers in parallel give four nines together. Redundancy is how you buy availability — but only for the part you actually duplicated. A single shared database behind those two app servers is still a single point of failure, and it caps the whole system at the database’s availability.
The JavaScript angle: health checks and graceful degradation
Redundancy only works if the load balancer can tell a healthy instance from a sick one — which means your Node service needs a health endpoint, and it should fail fast when a critical dependency is gone:
// The load balancer polls this; a 503 takes the instance out of rotation.
app.get('/healthz', async (req, res) => {
try {
await db.ping(); // is our critical dependency reachable?
res.status(200).json({ ok: true });
} catch {
res.status(503).json({ ok: false }); // pull me from the pool, don't route here
}
});
// Graceful degradation: serve stale cache instead of erroring when the DB is down.
async function getProfile(id) {
try {
return await db.getProfile(id);
} catch {
return cache.getStale(id) ?? { id, degraded: true }; // partial > nothing
}
} A service that returns a healthy 200 while its database is unreachable defeats the entire point of redundancy — the load balancer keeps sending traffic into a black hole. Honest health checks and graceful fallbacks are how availability math turns into actual uptime.
That completes the Foundations: a method, the scale math, the latency ratios, and the availability model. Next we start assembling real systems, beginning with the building blocks of scale.