CSV Import

Why CSV Import?

Most contact management starts with a spreadsheet. Rather than asking users to manually re-enter data, folksbase lets you drag-and-drop a CSV file and handles the rest — including figuring out which columns map to which fields. The import pipeline is designed around two constraints: files can be large (up to 200 MB), and the mapping step should feel effortless. That’s why the system streams the file to storage immediately, only reads the first 64 KB for preview, and uses AI to suggest column mappings before processing begins.

The Import Pipeline

The import flow has four distinct phases. Each phase is isolated so failures in one don’t corrupt the others.

Phase 1: Upload & Validation

When a user uploads a CSV, the API performs several checks before accepting the file:

MIME type validation — only text/csv, application/vnd.ms-excel, text/plain, and application/octet-stream are allowed
Binary content detection — the first 8 KB is scanned for null bytes to reject non-text files
Filename sanitization — path traversal characters and special characters are stripped
Size enforcement — a streaming byte counter aborts the upload if it exceeds 200 MB, even if the Content-Length header was spoofed
Rate limiting — uploads are limited to 5 per 10 minutes per user (on top of the global 100/60s limiter)

The file is streamed directly to Vercel Blob using ReadableStream.tee() — one branch goes to storage, the other feeds the preview parser. The full file never sits in memory.

Phase 2: Preview & AI Column Mapping

Only the first 64 KB of the file is read into memory. From that chunk, the system extracts:

Headers — the CSV column names
Sample rows — a few rows of data for the user to review
Total row count — computed by a streaming parser on a separate tee’d branch

The headers and sample rows are sent to Claude Haiku for column mapping suggestions. The AI returns a JSON object mapping each CSV header to a contact field (email, first_name, last_name, phone, company, notes) or null if it can’t determine a match. Each suggestion includes a confidence score (high or low) based on Levenshtein distance between the header name and the field name. See the AI Integration guide for details on how this works and what happens when the AI is unavailable. The user sees the preview table with suggested mappings and can adjust them before confirming.

Phase 2.5: Review Step — Large File Handling

After the user confirms column mappings, a review step lets them inspect the mapped data with validation feedback (e.g. invalid emails highlighted) before kicking off processing. The challenge: this needs to work for files with 500K+ rows without crashing the browser tab.

The problem with naive rendering

Parsing an entire CSV into a string[][] and holding it in React state doesn’t scale:

Rows	Naive approach (parsed arrays)	Optimized approach
10K	~4 MB	~1 MB
100K	~40 MB	~9 MB
500K	~200 MB	~55 MB
1M	~400 MB+	~109 MB

JS arrays have significant per-object overhead — each row creates an Array object and each cell is a separate String allocation. At 500K rows this easily exceeds mobile browser limits.

Index + on-demand parsing

Instead of parsing everything upfront, csv-parser.ts takes a two-phase approach:

indexCsv(text) scans the raw CSV text once and builds a lightweight index — an array of byte offsets marking where each row starts. For 500K rows, the offsets array is ~4 MB (one number per row at 8 bytes each) compared to ~200 MB for fully parsed arrays. The scanner handles \r\n line endings and quoted fields containing newlines.
getRow(index, rowIdx) parses a single row on demand by looking up its byte offset, slicing the raw text, and parsing just that line. The returned string[] is temporary — used for rendering, then garbage collected. Only ~30 rows exist in memory at any time.

Virtualized rendering

The review table uses @tanstack/react-virtual with a div-based layout. Regardless of file size, only ~30 DOM elements exist at any time (the visible rows plus overscan). The virtualizer manages a container with height: totalRows * ROW_HEIGHT and positions visible rows with transform: translateY(...).

Email validation scan

Email validation runs once over all rows in a single pass. Each getRow call creates a temporary array that’s immediately eligible for GC after the check. The only persistent output is a Set<number> of invalid row indices — negligible memory. For 500K rows this takes ~200–500ms. If it becomes a bottleneck, the scan can be moved to a Web Worker or chunked with scheduler.yield() to keep the UI responsive. Key files: apps/web/src/lib/csv-parser.ts (indexing and parsing), apps/web/src/components/imports/review-preview.tsx (virtualized table).

Phase 3: Chunked Processing

Once the user confirms the column mapping, an Inngest background job takes over. The processing uses a chunked approach for reliability:

download-and-chunk — streams the CSV from Vercel Blob through a parser, buffering rows into chunks of 500. Each chunk is stored in Redis with a 1-hour TTL.
process-batch-N — each chunk is processed independently in its own step.run() call. This means if batch 3 fails, only batch 3 retries — not the entire import.

Within each batch, the system:

Maps CSV columns to contact fields using the confirmed mapping
Stores unmapped columns as custom_fields (JSONB)
Validates and normalizes email addresses
Fetches Gravatar avatars (with p-limit concurrency of 10)
Deduplicates by email within the batch (last row wins)
Upserts contacts in sub-batches of 50, using Postgres xmax = 0 to distinguish inserts from updates
Reports progress after each sub-batch

Invalid rows (missing or malformed email) are counted as failures, with up to 10 error samples collected for the notification email.

Phase 4: Finalize

After all batches complete, the finalize step:

Generates an AI summary of the import (with a graceful fallback if the API call fails)
Sends an email notification with import stats (added, updated, failed counts)
Sends a separate error report if the failure rate exceeds 5%
Invalidates the stats cache so the dashboard reflects the new data

Tracking Progress

The frontend polls the import status endpoint during processing. The UI shows:

Upload progress — real percentage via XMLHttpRequest.upload.onprogress
Processing progress — added and updated counts with an ETA calculation
Rotating joke messages — to keep things fun while you wait
Confetti — on completion (extra confetti for your first import)

Error Handling

The import is designed to be resilient:

Scenario	Behavior
AI mapping fails	Falls back to `null` mappings (user maps manually)
Individual row has bad email	Counted as failed, added to error samples, processing continues
A batch step fails	Only that batch retries (up to 2 retries)
Entire job fails	Import marked as `failed`, user notified
AI summary generation fails	Fallback summary string used, import still completes
Blob storage unreachable	Job throws, Inngest retries the step

Getting Started

Features

Architecture

Background Jobs

API

Deployment

Contributing

Why CSV Import?

The Import Pipeline

Phase 1: Upload & Validation

Phase 2: Preview & AI Column Mapping

Phase 2.5: Review Step — Large File Handling

The problem with naive rendering

Index + on-demand parsing

Virtualized rendering

Email validation scan

Phase 3: Chunked Processing

Phase 4: Finalize

Tracking Progress

Error Handling

​Why CSV Import?

​The Import Pipeline

​Phase 1: Upload & Validation

​Phase 2: Preview & AI Column Mapping

​Phase 2.5: Review Step — Large File Handling

​The problem with naive rendering

​Index + on-demand parsing

​Virtualized rendering

​Email validation scan

​Phase 3: Chunked Processing

​Phase 4: Finalize

​Tracking Progress

​Error Handling

Why CSV Import?

The Import Pipeline

Phase 1: Upload & Validation

Phase 2: Preview & AI Column Mapping

Phase 2.5: Review Step — Large File Handling

The problem with naive rendering

Index + on-demand parsing

Virtualized rendering

Email validation scan

Phase 3: Chunked Processing

Phase 4: Finalize

Tracking Progress

Error Handling