Strings Deep Dive

JS Strings Are Immutable UTF-16 Sequences — Code Units Are Not Code Points

Strings Deep Dive

JavaScript strings are immutable sequences of UTF-16 code units, not bytes or characters — understanding the difference prevents subtle bugs when building, slicing, and comparing strings.

5 min read Level 2/5 #dsa#strings#utf-16
What you'll learn
  • Distinguish code units from code points and explain why they differ for emoji and CJK characters
  • Choose between substring, slice, and the deprecated substr correctly
  • Build strings efficiently using Array.join instead of repeated concatenation

A JavaScript string is an immutable, ordered sequence of UTF-16 code units. “Immutable” means every operation that appears to modify a string actually returns a new string — the original is untouched. “UTF-16 code unit” is not the same as a character: some characters occupy two code units (a surrogate pair).

Code Units vs. Code Points

The length property and bracket-index access work on code units, not characters. Most ASCII and Latin characters fit in one code unit. Characters above U+FFFF (emoji, many CJK extension characters) require two code units.

const s = "café"; // 'cafe' + combining accent → renders as 'café'
console.log(s.length);  // 5 (code units), not 4 (visible characters)

const emoji = "💩";
console.log(emoji.length);      // 2  — two UTF-16 code units
console.log(emoji.codePointAt(0).toString(16)); // '1f4a9'
console.log(emoji[0]);          // '\uD83D' — high surrogate (garbled)
console.log([...emoji].length); // 1  — spread iterates code points

Immutability and String Building

Because strings are immutable, concatenating n strings with += in a loop creates a new string each iteration — O(n²) total work in the worst case. Collect parts in an array and join once at the end.

// Slow for large n: O(n²) due to repeated copying
function buildBad(n) {
  let s = "";
  for (let i = 0; i < n; i++) s += "x";
  return s;
}

// Fast: O(n) — Array.join allocates exactly once
function buildGood(n) {
  return Array(n).fill("x").join("");
}

// Or with a pre-collected array of parts
function buildParts(parts) {
  const buf = [];
  for (const p of parts) buf.push(p);
  return buf.join("");
}

substring vs. slice vs. substr

MethodSignatureHandles negative indicesNotes
slice(start, end)end exclusiveYes — counts from endPreferred
substring(start, end)end exclusiveNo — swaps if start > endQuirky edge cases
substr(start, length)length not endStart can be negativeDeprecated, avoid
const s = "Hello, World!";
console.log(s.slice(7, 12));       // "World"
console.log(s.slice(-6, -1));      // "World"
console.log(s.substring(7, 12));   // "World"
console.log(s.substring(12, 7));   // "World" (args swapped automatically)

Normalization

Two strings that look identical can fail strict equality if they use different Unicode normalization forms. Use normalize() before comparing user-facing text.

const a = "café";       // NFC: e + combining accent
const b = "café";        // NFD: precomposed é

console.log(a === b);          // false — different code units
console.log(a.normalize("NFC") === b.normalize("NFC")); // true

Up Next

String searching — from the naive O(nm) approach to KMP and Rabin-Karp — is the practical application of everything covered here.

String Searching →