Observability

You Can't Fix What You Can't See

Observability

Logs, metrics, and traces — the three pillars — plus the RED and USE methods for what to measure, and structured logging with correlation IDs in Node.

9 min read Level 3/5 #system-design#observability#logging
What you'll learn
  • Distinguish logs, metrics, and traces and when each helps
  • Apply the RED and USE methods to decide what to measure
  • Add structured logging and request IDs to a Node service

When a system is small you debug it by reading the code. When it’s a fleet of services handling millions of requests, the code can’t tell you what actually happened to request #8,472,103 at 3am. Observability is the practice of instrumenting a system so you can answer questions about its behavior from the outside — ideally questions you didn’t know to ask in advance.

The distinction worth internalizing: monitoring tells you whether something is wrong (a dashboard, an alert); observability lets you ask why. You get there with three kinds of telemetry.

The three pillars

PillarWhat it isAnswersCost
LogsTimestamped event records”What happened in this exact request?”High volume
MetricsNumeric aggregates over time”Is error rate rising? p99 latency?”Cheap, low cardinality
TracesOne request’s path across services”Where did the 2s go?”Sampled
  • Logs are discrete events — “user 42 logged in”, “DB query failed”. Rich and specific, but voluminous and expensive to search at scale. Best for the detail of a single occurrence.
  • Metrics are numbers aggregated over time — request count, error rate, p99 latency, queue depth. Cheap to store and fast to query, which is what makes them the backbone of dashboards and alerts. They tell you that something changed, not the per-request detail of what.
  • Traces follow a single request as it hops across services, recording how long each hop took. When a request that touches six services is slow, a trace tells you which of the six ate the time — impossible to see from one service’s logs alone.

You need all three: metrics to notice a problem, traces to localize it, logs to understand it.

What to measure: RED and USE

Two mnemonics save you from the “we have 4,000 metrics and none of them help” trap by telling you the small set that matters.

RED — for request-driven services (your APIs):

  • Rate — requests per second.
  • Errors — how many of them failed.
  • Duration — the latency distribution (watch p99, not the average — averages hide the tail that’s actually hurting users).

USE — for resources (CPU, memory, disk, connection pools):

  • Utilization — how busy is it (% time in use)?
  • Saturation — how much extra work is queued and waiting?
  • Errors — error count for the resource.

Structured logging: logs a machine can read

A log line like User 42 failed to checkout is fine for a human and useless for a machine — you can’t filter, aggregate, or alert on free text reliably. Structured logging emits each event as JSON with consistent fields, so your log system can query it like data:

Free text vs structured logs script.js
// ❌ Unstructured — un-queryable, un-aggregatable.
console.log(`User ${userId} failed checkout: ${err.message}`);

// ✅ Structured — every field is filterable and aggregatable.
import pino from 'pino';
const log = pino();

log.error({
  event: 'checkout_failed',
  userId,
  orderId,
  reason: err.code,
  durationMs: Date.now() - start,
});
// → {"level":50,"event":"checkout_failed","userId":42,"orderId":"o_91",...}
▶ Preview: console

Now “how many checkouts failed with reason INSUFFICIENT_FUNDS in the last hour?” is a query, not a grep through prose.

The JavaScript angle: correlation IDs

A single user action might touch your gateway, an order service, and a payment service. To reconstruct the whole story, every log line for that action needs a shared correlation ID (a.k.a. request ID). Generate it at the edge, thread it through every downstream call, and attach it to every log:

A request-ID middleware with a child logger script.js
import { randomUUID } from 'node:crypto';
import pino from 'pino';
const baseLog = pino();

// Generate (or accept) a correlation ID, then bind a logger that
// stamps it onto every line this request produces.
function requestId(req, res, next) {
  req.id = req.headers['x-request-id'] ?? randomUUID();
  res.setHeader('x-request-id', req.id);     // echo it back to the client
  req.log = baseLog.child({ requestId: req.id });
  next();
}

app.use(requestId);

app.get('/orders/:id', async (req, res) => {
  req.log.info({ event: 'fetch_order', orderId: req.params.id });
  // Propagate the ID downstream so the order service logs it too:
  const order = await fetch(`${ORDERS}/orders/${req.params.id}`, {
    headers: { 'x-request-id': req.id },
  }).then((r) => r.json());
  res.json(order);
});
▶ Preview: console

Now a single query for requestId = "..." returns every log line across every service for that one user action — the cross-service story, reassembled. That same propagated ID is the seed of a distributed trace; tools like OpenTelemetry automate the threading for you, but the mechanic is exactly this.

Observability tells you when and where a system is breaking. The next lesson is about designing so that when a part breaks, the system as a whole keeps running — fault tolerance.