Search Systems

Inverted indexes, tokenization and analysis, the TF-IDF/BM25 intuition behind relevance, Elasticsearch basics, and when a database LIKE stops being enough.

9 min read Level 3/5 #system-design#search#inverted-index

What you'll learn

Explain why an inverted index beats scanning rows
Describe tokenization, analysis, and relevance scoring
Decide when LIKE no longer suffices and a search engine wins

“Let users search” sounds like a one-liner — WHERE description LIKE '%coffee%' — and for a small table it is. But that query scans every row and can’t use an index, so it grinds to a halt as data grows; it can’t rank results by relevance; and it has no idea that “running” should match “ran” or that “the” isn’t worth matching on. Real search is a different machine entirely, built around one data structure: the inverted index.

The inverted index

A normal (forward) index maps document → its words. An inverted index flips that: it maps word → the documents containing it. That inversion is the whole trick, because search asks the inverted question — “which documents contain ‘coffee’?” — and the inverted index answers it with a single lookup instead of a full scan.

Each word (a term) points to a postings list — the set of documents that contain it. A search for “quick coffee” looks up both terms and combines their postings lists. This is the same idea as a book’s index at the back: you don’t read every page to find “mitochondria,” you look it up and jump straight there.

Building a tiny inverted index script.js

function buildIndex(docs) {
  const index = new Map(); // term -> Set of doc ids
  for (const { id, text } of docs) {
    for (const term of analyze(text)) {
      if (!index.has(term)) index.set(term, new Set());
      index.get(term).add(id);
    }
  }
  return index;
}

// Search = look up each term's postings, intersect for AND semantics.
function search(index, query) {
  const sets = analyze(query).map((t) => index.get(t) ?? new Set());
  if (sets.length === 0) return [];
  return [...sets.reduce((a, b) => new Set([...a].filter((x) => b.has(x))))];
}

▶ Preview: console

Analysis: turning text into terms

What counts as a “term” is not obvious, and getting it right is most of search quality. The analysis pipeline transforms raw text into the normalized terms you actually index and query:

Tokenization — split text into tokens (usually on whitespace/punctuation).
Lowercasing — so “Coffee” and “coffee” match.
Stop-word removal — drop ultra-common words (“the”, “a”, “is”) that add noise and bloat without aiding relevance.
Stemming / lemmatization — reduce words to a root so “running”, “runs”, and “ran” all match “run”.
Synonyms — optionally expand “couch” to also match “sofa”.

Relevance: TF-IDF and BM25

A keyword match tells you which documents contain the words. Relevance scoring tells you which ones to show first. The classic intuition is TF-IDF, built from two opposing forces:

Term Frequency (TF) — a term appearing more often in a document makes that document more relevant to that term.
Inverse Document Frequency (IDF) — a term appearing in fewer documents overall is more discriminating, so matching it counts for more. “Coffee” is a stronger signal than “the” precisely because it’s rarer.

Multiply them: a document scores high when it uses a query term often and that term is rare across the corpus. BM25 is the modern refinement everyone actually uses — same intuition, but it adds saturation (the 50th occurrence of a word barely beats the 49th) and length normalization (so a long document doesn’t win just by being long). You rarely compute these yourself; you just need to know why one result ranks above another.

When LIKE stops being enough

Need	`LIKE` / SQL	Search engine
Exact substring on a small table	✅ Fine	Overkill
Full-text over millions of rows	❌ Full scan, slow	✅ Inverted index
Relevance ranking	❌ None	✅ BM25
Stemming, synonyms, typo tolerance	❌ Manual, painful	✅ Built in
Faceted filtering + aggregations	⚠️ Awkward	✅ First-class

The crossover comes when you need ranking, linguistic matching (stemming, typos, synonyms), or speed over large text — that’s when you reach for a dedicated engine like Elasticsearch (or OpenSearch), which is essentially a distributed, production-grade inverted index. It shards the index across nodes, replicates each shard for fault tolerance (the same redundancy ideas from earlier in this section), and exposes the analysis and BM25 scoring as configuration.

A practical middle ground worth knowing: Postgres full-text search (tsvector / tsquery) gives you a real inverted index, stemming, and ranking inside your database — often enough until you need Elasticsearch’s scale, distribution, and richer relevance tuning.

Search is the last pure data-structures lesson of this section. The final piece of reliable, real-world systems is keeping them secure — authentication, authorization, and the cryptographic basics — which is where we head next.