Streams for Large Data Processing

Process Gigabyte Files Line-by-Line Without Loading Them into Memory

Streams for Large Data Processing

Node.js Readable and Transform streams let you pipe data through an algorithm one chunk at a time, keeping memory flat even on multi-gigabyte inputs — including backpressure to avoid overwhelming downstream consumers.

5 min read Level 3/5 #dsa#nodejs#interview
What you'll learn
  • Process a large file line-by-line using readline and a Readable stream
  • Build a Transform stream that filters or maps chunks on the fly
  • Explain backpressure and how pipe handles it automatically

Loading a 4 GB log file with fs.readFileSync allocates 4 GB of heap — a quick way to crash a Node process. Streams let you handle the same file with a fixed memory footprint, because you never hold more than one chunk at a time.

Reading a File Line-by-Line

readline.createInterface wraps any Readable and emits one line event per newline. It is the idiomatic way to process CSVs, logs, or NDJSON files.

import { createReadStream } from 'node:fs';
import { createInterface } from 'node:readline';

async function countWords(filePath) {
  const stream = createReadStream(filePath, { encoding: 'utf8' });
  const rl = createInterface({ input: stream, crlfDelay: Infinity });

  let wordCount = 0;
  for await (const line of rl) {
    wordCount += line.split(/\s+/).filter(Boolean).length;
  }
  return wordCount;
}

const total = await countWords('./war-and-peace.txt');
console.log('Total words:', total);

The for await...of loop on the readline interface keeps backpressure intact — it only reads the next line when your code is ready for it.

Transform Streams for On-the-Fly Mapping

A Transform stream sits in the middle of a pipeline, receiving chunks and emitting transformed ones. This is useful when you want to filter rows, parse JSON, or apply an algorithm to each record without buffering everything.

import { createReadStream, createWriteStream } from 'node:fs';
import { Transform } from 'node:stream';
import { pipeline } from 'node:stream/promises';

// Keep only lines where the third CSV column exceeds a threshold
class FilterHighValue extends Transform {
  constructor(threshold) {
    super({ readableObjectMode: false, writableObjectMode: false });
    this._threshold = threshold;
    this._partial = '';
  }

  _transform(chunk, _encoding, callback) {
    const lines = (this._partial + chunk.toString()).split('\n');
    this._partial = lines.pop(); // save incomplete trailing line
    for (const line of lines) {
      const cols = line.split(',');
      if (Number(cols[2]) > this._threshold) {
        this.push(line + '\n');
      }
    }
    callback();
  }

  _flush(callback) {
    if (this._partial) this.push(this._partial + '\n');
    callback();
  }
}

await pipeline(
  createReadStream('./data.csv'),
  new FilterHighValue(1000),
  createWriteStream('./filtered.csv'),
);

Backpressure

Backpressure is the mechanism that prevents a fast producer from overwhelming a slow consumer. When the writable side’s internal buffer is full, write() returns false — the readable should pause until the drain event fires.

stream.pipeline (and its promise variant from node:stream/promises) wires backpressure automatically. Avoid manually piping and forgetting to handle drain — that is the most common stream memory leak.

APIBackpressureError forwarding
readable.pipe(writable)YesPartial — must add error handlers
stream.pipeline(...)YesYes — one callback for all errors
stream/promises pipelineYesYes — rejects on any stream error

Up Next

Once your algorithm runs without crashing the process, you need to know where it spends its time. Profiling gives you that picture.

Profiling Node.js Algorithms →