Distributed Locks

Mutual Exclusion Across Machines — Carefully

Distributed Locks

Redis locks with SET NX PX, leases and TTLs, the Redlock algorithm and its critiques, and the fencing token that makes locks actually safe.

9 min read Level 4/5 #system-design#distributed-locks#redis
What you'll learn
  • Build a basic Redis lock with SET NX PX and a TTL
  • Explain why Redlock is debated and where it breaks
  • Use fencing tokens to make locks correct under pauses

A distributed lock extends the familiar mutex across machines: only one process in your whole fleet may hold it at a time. You reach for one whenever a job must not run twice concurrently — sending an invoice, processing a payout, running a scheduled task that’s deployed to ten instances but should fire once.

Here’s the uncomfortable truth up front: distributed locks are much harder to get right than they look, and a naive implementation will occasionally let two holders in — which is exactly the case that corrupts data. This lesson builds one up, then shows why “occasionally” is the whole problem.

A basic Redis lock

The standard building block is a single atomic Redis command. SET key value NX PX ttl sets the key only if it doesn’t exist (NX) and gives it an expiry (PX milliseconds). Acquire = the SET succeeds; release = delete the key.

Acquire and release with ioredis script.js
import Redis from 'ioredis';
import { randomUUID } from 'node:crypto';
const redis = new Redis();

async function acquire(resource, ttlMs = 10_000) {
  const token = randomUUID();                  // unique owner token
  // NX = only if absent; PX = auto-expire so a dead holder can't lock forever.
  const ok = await redis.set(`lock:${resource}`, token, 'PX', ttlMs, 'NX');
  return ok === 'OK' ? token : null;           // token proves ownership
}

async function release(resource, token) {
  // Only delete if WE still own it — compare-and-delete, atomically, in Lua.
  const lua = `
    if redis.call("get", KEYS[1]) == ARGV[1] then
      return redis.call("del", KEYS[1])
    else return 0 end`;
  return redis.eval(lua, 1, `lock:${resource}`, token);
}
▶ Preview: console

Two non-obvious details that are load-bearing:

  • The TTL is mandatory. If the holder crashes before releasing, the TTL expires the lock so it isn’t held forever. A lock with no expiry is a deadlock waiting to happen.
  • Release must check ownership. Without the Lua compare-and-delete, you could delete someone else’s lock: your lock expires, another process acquires it, and then your stale del frees their lock. The unique token plus atomic check-and-delete prevents that.

This is a lease, not a true lock: you hold it for a bounded time, and it can be taken from you when that time elapses. That bound is the source of every problem that follows.

The dangerous gap: a TTL can expire mid-work

Picture this sequence, which is rarer than you’d like but not rare enough:

Distributed Locks — architecture diagram

Worker A acquires the lock, then suffers a stop-the-world GC pause (or a slow syscall, or gets de-scheduled) for longer than the TTL. While A is frozen, the lock expires, B legitimately acquires it, and then A wakes up still believing it holds the lock and writes. Now two workers act as the sole holder simultaneously. No amount of token-checking on release helps, because A’s problem is on the write, not the release.

Redlock and its critique

To make Redis locks more robust against a Redis node failing, Redis’s author proposed Redlock: acquire the lock on a majority of N independent Redis nodes (say 3 of 5), so losing one Redis instance doesn’t lose the lock. It’s the quorum idea again.

Redlock is genuinely useful for efficiency locks — “probably don’t do this work twice, it’s just wasteful if we do.” But it drew a famous critique (from Martin Kleppmann) for correctness locks: no timing-based lock can be safe against the GC-pause problem above, because the pause can outlast the lease regardless of how many Redis nodes agreed. Adding more Redis nodes makes the lock more available, not more correct.

Fencing tokens: the fix that actually works

The real fix doesn’t try to make the lock perfect — it makes stale holders harmless. Each lock grant hands out a monotonically increasing fencing token. Every write to the protected resource carries its token, and the resource rejects any token lower than the highest it has seen.

Distributed Locks — architecture diagram

Now A’s late write is rejected because its token (33) is older than the one B already used (34). The storage layer enforces “newest holder wins” no matter how confused the clients are. The lock can be wrong; the outcome is still correct.

The honest summary: prefer to avoid distributed locks. Idempotency keys, single-writer-via-leader-election, and database constraints solve many “don’t do it twice” problems more safely. When you do need a lock, know whether it’s for efficiency or correctness — and fence the correctness ones.

Those fencing tokens need to be unique and ordered — which is the same challenge as generating unique IDs at scale, our next topic.