Unicode and Surrogate Pairs

Iterate Code Points Not Code Units to Correctly Handle Emoji and Non-BMP Text

Unicode and Surrogate Pairs

JavaScript encodes strings as UTF-16, where characters above U+FFFF use two code units called a surrogate pair — understanding this prevents length bugs, broken slices, and garbled output with emoji and international text.

5 min read Level 2/5 #dsa#strings#unicode
What you'll learn
  • Explain surrogate pairs and why "💩".length === 2
  • Iterate a string by code point using Array.from and the string iterator
  • Apply Unicode normalization (NFC/NFD) before comparing or storing user text

Unicode assigns a code point (a number) to every character. The Basic Multilingual Plane (BMP) covers U+0000 to U+FFFF — these fit in one UTF-16 code unit. Characters above U+FFFF (supplementary planes) are encoded as a pair of code units called a surrogate pair: a high surrogate (U+D800–U+DBFF) followed by a low surrogate (U+DC00–U+DFFF).

Why ”💩“.length === 2

The pile-of-poo emoji is U+1F4A9 — above U+FFFF. JavaScript stores it as two UTF-16 code units: \uD83D (high surrogate) and \uDCA9 (low surrogate). The length property counts code units, so it returns 2.

const poo = "💩";
console.log(poo.length);           // 2 — code units
console.log(poo.charCodeAt(0).toString(16)); // 'd83d' — high surrogate
console.log(poo.charCodeAt(1).toString(16)); // 'dca9' — low surrogate
console.log(poo.codePointAt(0).toString(16)); // '1f4a9' — actual code point

// Slicing at a surrogate boundary produces a broken character
console.log(poo.slice(0, 1));      // '\uD83D' — half a surrogate pair

Iterating Code Points Correctly

The string iterator (used by for...of and spread) is surrogate-pair aware. It yields each full character as a string, regardless of how many code units it uses.

const text = "A💩Z";

// Wrong: iterates code units — breaks surrogate pair
for (let i = 0; i < text.length; i++) {
  process.stdout.write(text[i] + " "); // A \uD83D \uDCA9 Z
}

// Correct: iterates code points
for (const ch of text) {
  process.stdout.write(ch + " "); // A 💩 Z
}

// Array.from uses the string iterator — gives correct length
console.log(Array.from(text).length); // 3
console.log([...text].length);         // 3

// Reverse a string correctly
const reversed = [...text].reverse().join("");
console.log(reversed); // "Z💩A"

Counting “Characters” (Grapheme Clusters)

A user-perceived character (grapheme cluster) can consist of multiple code points. For example, a base character plus a combining accent is two code points but one visible character. The Intl.Segmenter API (ES2022) handles this:

const flag = "🇺🇸"; // Regional indicator U + S — two code points, one flag
console.log([...flag].length);  // 2 — code points

const segmenter = new Intl.Segmenter();
const segments = [...segmenter.segment(flag)];
console.log(segments.length);   // 1 — one grapheme cluster

Normalization

The same visible text can have multiple Unicode representations. NFC uses precomposed characters (single code point for “é”). NFD decomposes into base character + combining accent (two code points). Always normalize before comparing, storing, or indexing text.

const nfc = "é";         // é as single code point
const nfd = "é";        // e + combining acute accent

console.log(nfc.length);      // 1
console.log(nfd.length);      // 2
console.log(nfc === nfd);     // false

// Normalize both to NFC before comparing
console.log(nfc.normalize("NFC") === nfd.normalize("NFC")); // true

Complexity Reference

OperationComplexityNotes
Array.from(str)O(n) code pointsn = number of code points
[…str].reverse().join("")O(n)Correct reversal via iterator
str.normalize(“NFC”)O(n)n = code unit length
Intl.Segmenter iterationO(n)n = code unit length
str[i] / str.charCodeAt(i)O(1)Code unit access
str.codePointAt(i)O(1)Reads 1 or 2 code units

Up Next

With arrays and strings thoroughly covered, the next section introduces hashing — the data structure that powers O(1) lookups, deduplication, and frequency counting.

Hashing Introduction →