Why CSV Import?
Most contact management starts with a spreadsheet. Rather than asking users to manually re-enter data, folksbase lets you drag-and-drop a CSV file and handles the rest — including figuring out which columns map to which fields. The import pipeline is designed around two constraints: files can be large (up to 200 MB), and the mapping step should feel effortless. That’s why the system streams the file to storage immediately, only reads the first 64 KB for preview, and uses AI to suggest column mappings before processing begins.The Import Pipeline
The import flow has four distinct phases. Each phase is isolated so failures in one don’t corrupt the others.Phase 1: Upload & Validation
When a user uploads a CSV, the API performs several checks before accepting the file:- MIME type validation — only
text/csv,application/vnd.ms-excel,text/plain, andapplication/octet-streamare allowed - Binary content detection — the first 8 KB is scanned for null bytes to reject non-text files
- Filename sanitization — path traversal characters and special characters are stripped
- Size enforcement — a streaming byte counter aborts the upload if it exceeds 200 MB, even if the
Content-Lengthheader was spoofed - Rate limiting — uploads are limited to 5 per 10 minutes per user (on top of the global 100/60s limiter)
ReadableStream.tee() — one branch goes to storage, the other feeds the preview parser. The full file never sits in memory.
Phase 2: Preview & AI Column Mapping
Only the first 64 KB of the file is read into memory. From that chunk, the system extracts:- Headers — the CSV column names
- Sample rows — a few rows of data for the user to review
- Total row count — computed by a streaming parser on a separate tee’d branch
email, first_name, last_name, phone, company, notes) or null if it can’t determine a match.
Each suggestion includes a confidence score (high or low) based on Levenshtein distance between the header name and the field name. See the AI Integration guide for details on how this works and what happens when the AI is unavailable.
The user sees the preview table with suggested mappings and can adjust them before confirming.
Phase 2.5: Review Step — Large File Handling
After the user confirms column mappings, a review step lets them inspect the mapped data with validation feedback (e.g. invalid emails highlighted) before kicking off processing. The challenge: this needs to work for files with 500K+ rows without crashing the browser tab.The problem with naive rendering
Parsing an entire CSV into astring[][] and holding it in React state doesn’t scale:
| Rows | Naive approach (parsed arrays) | Optimized approach |
|---|---|---|
| 10K | ~4 MB | ~1 MB |
| 100K | ~40 MB | ~9 MB |
| 500K | ~200 MB | ~55 MB |
| 1M | ~400 MB+ | ~109 MB |
Array object and each cell is a separate String allocation. At 500K rows this easily exceeds mobile browser limits.
Index + on-demand parsing
Instead of parsing everything upfront,csv-parser.ts takes a two-phase approach:
-
indexCsv(text)scans the raw CSV text once and builds a lightweight index — an array of byte offsets marking where each row starts. For 500K rows, the offsets array is ~4 MB (onenumberper row at 8 bytes each) compared to ~200 MB for fully parsed arrays. The scanner handles\r\nline endings and quoted fields containing newlines. -
getRow(index, rowIdx)parses a single row on demand by looking up its byte offset, slicing the raw text, and parsing just that line. The returnedstring[]is temporary — used for rendering, then garbage collected. Only ~30 rows exist in memory at any time.
Virtualized rendering
The review table uses@tanstack/react-virtual with a div-based layout. Regardless of file size, only ~30 DOM elements exist at any time (the visible rows plus overscan). The virtualizer manages a container with height: totalRows * ROW_HEIGHT and positions visible rows with transform: translateY(...).
Email validation scan
Email validation runs once over all rows in a single pass. EachgetRow call creates a temporary array that’s immediately eligible for GC after the check. The only persistent output is a Set<number> of invalid row indices — negligible memory. For 500K rows this takes ~200–500ms. If it becomes a bottleneck, the scan can be moved to a Web Worker or chunked with scheduler.yield() to keep the UI responsive.
Web Worker feasibility
TheindexCsv function is a strong candidate for Web Worker offloading — it’s pure, self-contained, and has no DOM or React dependencies. The worker would receive the raw CSV text, build the index, and post back { headers, lineOffsets, totalRows }. The main thread keeps the text reference for getRow() slicing. Next.js supports new Worker(new URL(...)) with webpack 5.
Key files: apps/web/src/lib/csv-parser.ts (indexing and parsing), apps/web/src/components/imports/review-preview.tsx (virtualized table).
Phase 3: Chunked Processing
Once the user confirms the column mapping, an Inngest background job takes over. The processing uses a chunked approach for reliability:- download-and-chunk — streams the CSV from Vercel Blob through a parser, buffering rows into chunks of 500. Each chunk is stored in Redis with a 1-hour TTL.
-
process-batch-N — each chunk is processed independently in its own
step.run()call. This means if batch 3 fails, only batch 3 retries — not the entire import.
- Maps CSV columns to contact fields using the confirmed mapping
- Stores unmapped columns as
custom_fields(JSONB) - Validates and normalizes email addresses
- Fetches Gravatar avatars (with
p-limitconcurrency of 10) - Deduplicates by email within the batch (last row wins)
- Upserts contacts in sub-batches of 50, using Postgres
xmax = 0to distinguish inserts from updates - Reports progress after each sub-batch
Phase 4: Finalize
After all batches complete, the finalize step:- Generates an AI summary of the import (with a graceful fallback if the API call fails)
- Sends an email notification with import stats (added, updated, failed counts)
- Sends a separate error report if the failure rate exceeds 5%
- Invalidates the stats cache so the dashboard reflects the new data
Tracking Progress
The frontend polls the import status endpoint during processing. The UI shows:- Upload progress — real percentage via
XMLHttpRequest.upload.onprogress - Processing progress — added and updated counts with an ETA calculation
- Rotating joke messages — to keep things fun while you wait
- Confetti — on completion (extra confetti for your first import)
Error Handling
The import is designed to be resilient:| Scenario | Behavior |
|---|---|
| AI mapping fails | Falls back to null mappings (user maps manually) |
| Individual row has bad email | Counted as failed, added to error samples, processing continues |
| A batch step fails | Only that batch retries (up to 2 retries) |
| Entire job fails | Import marked as failed, user notified |
| AI summary generation fails | Fallback summary string used, import still completes |
| Blob storage unreachable | Job throws, Inngest retries the step |