Iterate Code Points Not Code Units to Correctly Handle Emoji and Non-BMP Text
Unicode and Surrogate Pairs
JavaScript encodes strings as UTF-16, where characters above U+FFFF use two code units called a surrogate pair — understanding this prevents length bugs, broken slices, and garbled output with emoji and international text.
What you'll learn
- Explain surrogate pairs and why "💩".length === 2
- Iterate a string by code point using Array.from and the string iterator
- Apply Unicode normalization (NFC/NFD) before comparing or storing user text
Unicode assigns a code point (a number) to every character. The Basic Multilingual Plane (BMP) covers U+0000 to U+FFFF — these fit in one UTF-16 code unit. Characters above U+FFFF (supplementary planes) are encoded as a pair of code units called a surrogate pair: a high surrogate (U+D800–U+DBFF) followed by a low surrogate (U+DC00–U+DFFF).
Why ”💩“.length === 2
The pile-of-poo emoji is U+1F4A9 — above U+FFFF. JavaScript stores it as two
UTF-16 code units: \uD83D (high surrogate) and \uDCA9 (low surrogate).
The length property counts code units, so it returns 2.
const poo = "💩";
console.log(poo.length); // 2 — code units
console.log(poo.charCodeAt(0).toString(16)); // 'd83d' — high surrogate
console.log(poo.charCodeAt(1).toString(16)); // 'dca9' — low surrogate
console.log(poo.codePointAt(0).toString(16)); // '1f4a9' — actual code point
// Slicing at a surrogate boundary produces a broken character
console.log(poo.slice(0, 1)); // '\uD83D' — half a surrogate pair Iterating Code Points Correctly
The string iterator (used by for...of and spread) is surrogate-pair aware. It
yields each full character as a string, regardless of how many code units it uses.
const text = "A💩Z";
// Wrong: iterates code units — breaks surrogate pair
for (let i = 0; i < text.length; i++) {
process.stdout.write(text[i] + " "); // A \uD83D \uDCA9 Z
}
// Correct: iterates code points
for (const ch of text) {
process.stdout.write(ch + " "); // A 💩 Z
}
// Array.from uses the string iterator — gives correct length
console.log(Array.from(text).length); // 3
console.log([...text].length); // 3
// Reverse a string correctly
const reversed = [...text].reverse().join("");
console.log(reversed); // "Z💩A" Counting “Characters” (Grapheme Clusters)
A user-perceived character (grapheme cluster) can consist of multiple code
points. For example, a base character plus a combining accent is two code points
but one visible character. The Intl.Segmenter API (ES2022) handles this:
const flag = "🇺🇸"; // Regional indicator U + S — two code points, one flag
console.log([...flag].length); // 2 — code points
const segmenter = new Intl.Segmenter();
const segments = [...segmenter.segment(flag)];
console.log(segments.length); // 1 — one grapheme cluster Normalization
The same visible text can have multiple Unicode representations. NFC uses precomposed characters (single code point for “é”). NFD decomposes into base character + combining accent (two code points). Always normalize before comparing, storing, or indexing text.
const nfc = "é"; // é as single code point
const nfd = "é"; // e + combining acute accent
console.log(nfc.length); // 1
console.log(nfd.length); // 2
console.log(nfc === nfd); // false
// Normalize both to NFC before comparing
console.log(nfc.normalize("NFC") === nfd.normalize("NFC")); // true Complexity Reference
| Operation | Complexity | Notes |
|---|---|---|
| Array.from(str) | O(n) code points | n = number of code points |
| […str].reverse().join("") | O(n) | Correct reversal via iterator |
| str.normalize(“NFC”) | O(n) | n = code unit length |
| Intl.Segmenter iteration | O(n) | n = code unit length |
| str[i] / str.charCodeAt(i) | O(1) | Code unit access |
| str.codePointAt(i) | O(1) | Reads 1 or 2 code units |
Up Next
With arrays and strings thoroughly covered, the next section introduces hashing — the data structure that powers O(1) lookups, deduplication, and frequency counting.
Hashing Introduction →