JS Strings Are Immutable UTF-16 Sequences — Code Units Are Not Code Points
Strings Deep Dive
JavaScript strings are immutable sequences of UTF-16 code units, not bytes or characters — understanding the difference prevents subtle bugs when building, slicing, and comparing strings.
What you'll learn
- Distinguish code units from code points and explain why they differ for emoji and CJK characters
- Choose between substring, slice, and the deprecated substr correctly
- Build strings efficiently using Array.join instead of repeated concatenation
A JavaScript string is an immutable, ordered sequence of UTF-16 code units. “Immutable” means every operation that appears to modify a string actually returns a new string — the original is untouched. “UTF-16 code unit” is not the same as a character: some characters occupy two code units (a surrogate pair).
Code Units vs. Code Points
The length property and bracket-index access work on code units, not
characters. Most ASCII and Latin characters fit in one code unit. Characters
above U+FFFF (emoji, many CJK extension characters) require two code units.
const s = "café"; // 'cafe' + combining accent → renders as 'café'
console.log(s.length); // 5 (code units), not 4 (visible characters)
const emoji = "💩";
console.log(emoji.length); // 2 — two UTF-16 code units
console.log(emoji.codePointAt(0).toString(16)); // '1f4a9'
console.log(emoji[0]); // '\uD83D' — high surrogate (garbled)
console.log([...emoji].length); // 1 — spread iterates code points Immutability and String Building
Because strings are immutable, concatenating n strings with += in a loop
creates a new string each iteration — O(n²) total work in the worst case.
Collect parts in an array and join once at the end.
// Slow for large n: O(n²) due to repeated copying
function buildBad(n) {
let s = "";
for (let i = 0; i < n; i++) s += "x";
return s;
}
// Fast: O(n) — Array.join allocates exactly once
function buildGood(n) {
return Array(n).fill("x").join("");
}
// Or with a pre-collected array of parts
function buildParts(parts) {
const buf = [];
for (const p of parts) buf.push(p);
return buf.join("");
} substring vs. slice vs. substr
| Method | Signature | Handles negative indices | Notes |
|---|---|---|---|
| slice(start, end) | end exclusive | Yes — counts from end | Preferred |
| substring(start, end) | end exclusive | No — swaps if start > end | Quirky edge cases |
| substr(start, length) | length not end | Start can be negative | Deprecated, avoid |
const s = "Hello, World!";
console.log(s.slice(7, 12)); // "World"
console.log(s.slice(-6, -1)); // "World"
console.log(s.substring(7, 12)); // "World"
console.log(s.substring(12, 7)); // "World" (args swapped automatically) Normalization
Two strings that look identical can fail strict equality if they use different
Unicode normalization forms. Use normalize() before comparing user-facing text.
const a = "café"; // NFC: e + combining accent
const b = "café"; // NFD: precomposed é
console.log(a === b); // false — different code units
console.log(a.normalize("NFC") === b.normalize("NFC")); // true Up Next
String searching — from the naive O(nm) approach to KMP and Rabin-Karp — is the practical application of everything covered here.
String Searching →