Skip to main content

Streaming Architecture

CSV files can be large — tens of thousands of rows, hundreds of megabytes. Loading an entire file into memory would crash the server. folksbase uses streaming throughout the CSV import and export pipelines to keep memory usage constant regardless of file size.

Why Streaming?

The naive approach to handling file uploads is simple: read the entire file into a buffer, parse it, process it. This works for small files but fails catastrophically for large ones:
  • A 200 MB CSV file would consume 200 MB+ of server memory just to hold the raw content
  • Parsing it into objects doubles or triples that memory usage
  • Multiple concurrent uploads could exhaust the server’s memory entirely
Streaming solves this by processing data in small chunks as it arrives. At any point, only a small portion of the file is in memory.

CSV Import Flow

The import pipeline has two phases: upload (synchronous, in the route handler) and processing (asynchronous, in an Inngest background job).

Phase 1: Upload (Route Handler)

When a CSV file is uploaded, the route handler does several things simultaneously without buffering the entire file:
  1. Stream splitting with tee() — The file’s ReadableStream is split into two branches using .tee(). One branch streams to Vercel Blob for storage, the other counts rows via a streaming CSV parser.
  2. First 64KB preview — A custom teeWithHead() function reads the first 64KB from the upload stream without consuming the rest. This small chunk is enough to extract headers and 5 preview rows for the column mapping UI.
  3. Byte-level size enforcement — A limitStream() transform wraps the upload stream with a byte counter. If the stream exceeds 200 MB (even if the Content-Length header was spoofed), the upload is aborted mid-stream.
  4. Binary content detection — The first 8KB is scanned for null bytes to detect binary files masquerading as CSVs.
At no point is the full file held in memory. The preview uses 64KB, the row counter processes one record at a time, and the blob upload streams directly from the client to storage.

Phase 2: Processing (Inngest Job)

Once the user confirms the column mapping, an Inngest background job processes the file:
  1. download-and-chunk — Streams the CSV from Vercel Blob through a csv-parse parser. As rows are parsed, they accumulate in a buffer. Every 500 rows, the buffer is flushed to Redis as a JSON chunk with a 1-hour TTL. Memory usage stays constant at ~500 rows regardless of file size.
  2. process-batch-N (one step per chunk) — Each chunk is processed independently as a separate Inngest step. This means:
    • If batch 3 fails, only batch 3 retries — not the entire import
    • Each batch reads its chunk from Redis, upserts contacts in 50-row sub-batches, and deletes the chunk when done
    • Progress is reported after each sub-batch for real-time UI updates
  3. finalize — Marks the import as completed, invalidates caches, generates an AI summary, and sends notification emails.

Why Chunked Steps?

The previous approach used a single monolithic step that streamed, parsed, and inserted all rows. If it failed at row 9,000 of 10,000, the entire import would retry from the beginning. The chunked approach isolates failures. Each 500-row chunk is an independent Inngest step with its own retry budget. A transient database error on one chunk doesn’t affect the others.

CSV Export Flow

Exports use a similar streaming approach, but in reverse — reading from the database and writing to blob storage.

How It Works

  1. Metadata resolution — A single step discovers all custom field keys (via SELECT DISTINCT jsonb_object_keys) and counts the total rows. This avoids a two-pass scan over the data.
  2. Streaming CSV generation — The createCsvStream() function creates a ReadableStream that:
    • Fetches contacts in cursor-based batches of 1,000
    • Pipes each batch through csv-stringify’s streaming API
    • Never holds more than one batch in memory
  3. Direct blob upload — The CSV stream is piped directly to Vercel Blob. There’s no intermediate file or buffer.

Why Cursor-Based Batching?

OFFSET-based pagination (LIMIT 1000 OFFSET 5000) gets slower as the offset grows — Postgres has to scan and discard all the skipped rows. Cursor-based pagination (WHERE id > last_seen_id ORDER BY id LIMIT 1000) is consistently fast because it uses the primary key index. For an export of 100,000 contacts, cursor-based batching keeps every batch equally fast. OFFSET-based would make the last batches significantly slower than the first.

Key Design Decisions

Why Vercel Blob for storage?

Vercel Blob supports streaming uploads and downloads natively. The upload stream from the client can be piped directly to blob storage without buffering. For downloads, the blob stream can be piped directly to the HTTP response.

Why Redis for chunk storage?

Inngest steps are isolated — they can’t share in-memory state. Redis provides fast, temporary storage for the parsed CSV chunks between the download step and the processing steps. The 1-hour TTL ensures chunks are cleaned up even if the job fails.

Why 500-row chunks and 50-row sub-batches?

  • 500 rows per chunk balances Redis storage (each chunk is a few hundred KB of JSON) against the number of Inngest steps (a 10,000-row import creates 20 steps, not 10,000).
  • 50 rows per sub-batch balances database insert performance against progress reporting granularity. Users see the progress bar update every 50 rows.

What’s Next?