Strings Deep Dive

JavaScript strings are immutable sequences of UTF-16 code units, not bytes or characters — understanding the difference prevents subtle bugs when building, slicing, and comparing strings.

5 min read Level 2/5 #dsa#strings#utf-16

What you'll learn

Distinguish code units from code points and explain why they differ for emoji and CJK characters
Choose between substring, slice, and the deprecated substr correctly
Build strings efficiently using Array.join instead of repeated concatenation

A JavaScript string is an immutable, ordered sequence of UTF-16 code units. “Immutable” means every operation that appears to modify a string actually returns a new string — the original is untouched. “UTF-16 code unit” is not the same as a character: some characters occupy two code units (a surrogate pair).

Code Units vs. Code Points

The length property and bracket-index access work on code units, not characters. Most ASCII and Latin characters fit in one code unit. Characters above U+FFFF (emoji, many CJK extension characters) require two code units.

const s = "café"; // 'cafe' + combining accent → renders as 'café'
console.log(s.length);  // 5 (code units), not 4 (visible characters)

const emoji = "💩";
console.log(emoji.length);      // 2  — two UTF-16 code units
console.log(emoji.codePointAt(0).toString(16)); // '1f4a9'
console.log(emoji[0]);          // '\uD83D' — high surrogate (garbled)
console.log([...emoji].length); // 1  — spread iterates code points

Immutability and String Building

Because strings are immutable, concatenating n strings with += in a loop creates a new string each iteration — O(n²) total work in the worst case. Collect parts in an array and join once at the end.

// Slow for large n: O(n²) due to repeated copying
function buildBad(n) {
  let s = "";
  for (let i = 0; i < n; i++) s += "x";
  return s;
}

// Fast: O(n) — Array.join allocates exactly once
function buildGood(n) {
  return Array(n).fill("x").join("");
}

// Or with a pre-collected array of parts
function buildParts(parts) {
  const buf = [];
  for (const p of parts) buf.push(p);
  return buf.join("");
}

substring vs. slice vs. substr

Method	Signature	Handles negative indices	Notes
slice(start, end)	end exclusive	Yes — counts from end	Preferred
substring(start, end)	end exclusive	No — swaps if start > end	Quirky edge cases
substr(start, length)	length not end	Start can be negative	Deprecated, avoid

const s = "Hello, World!";
console.log(s.slice(7, 12));       // "World"
console.log(s.slice(-6, -1));      // "World"
console.log(s.substring(7, 12));   // "World"
console.log(s.substring(12, 7));   // "World" (args swapped automatically)

Normalization

Two strings that look identical can fail strict equality if they use different Unicode normalization forms. Use normalize() before comparing user-facing text.

const a = "café";       // NFC: e + combining accent
const b = "café";        // NFD: precomposed é

console.log(a === b);          // false — different code units
console.log(a.normalize("NFC") === b.normalize("NFC")); // true

Up Next

String searching — from the naive O(nm) approach to KMP and Rabin-Karp — is the practical application of everything covered here.

String Searching →