Skip to main content

Why CSV Import?

Most contact management starts with a spreadsheet. Rather than asking users to manually re-enter data, folksbase lets you drag-and-drop a CSV file and handles the rest — including figuring out which columns map to which fields. The import pipeline is designed around two constraints: files can be large (up to 200 MB), and the mapping step should feel effortless. That’s why the system streams the file to storage immediately, only reads the first 64 KB for preview, and uses AI to suggest column mappings before processing begins.

The Import Pipeline

The import flow has four distinct phases. Each phase is isolated so failures in one don’t corrupt the others.

Phase 1: Upload & Validation

When a user uploads a CSV, the API performs several checks before accepting the file:
  • MIME type validation — only text/csv, application/vnd.ms-excel, text/plain, and application/octet-stream are allowed
  • Binary content detection — the first 8 KB is scanned for null bytes to reject non-text files
  • Filename sanitization — path traversal characters and special characters are stripped
  • Size enforcement — a streaming byte counter aborts the upload if it exceeds 200 MB, even if the Content-Length header was spoofed
  • Rate limiting — uploads are limited to 5 per 10 minutes per user (on top of the global 100/60s limiter)
The file is streamed directly to Vercel Blob using ReadableStream.tee() — one branch goes to storage, the other feeds the preview parser. The full file never sits in memory.

Phase 2: Preview & AI Column Mapping

Only the first 64 KB of the file is read into memory. From that chunk, the system extracts:
  • Headers — the CSV column names
  • Sample rows — a few rows of data for the user to review
  • Total row count — computed by a streaming parser on a separate tee’d branch
The headers and sample rows are sent to Claude Haiku for column mapping suggestions. The AI returns a JSON object mapping each CSV header to a contact field (email, first_name, last_name, phone, company, notes) or null if it can’t determine a match. Each suggestion includes a confidence score (high or low) based on Levenshtein distance between the header name and the field name. See the AI Integration guide for details on how this works and what happens when the AI is unavailable. The user sees the preview table with suggested mappings and can adjust them before confirming.

Phase 2.5: Review Step — Large File Handling

After the user confirms column mappings, a review step lets them inspect the mapped data with validation feedback (e.g. invalid emails highlighted) before kicking off processing. The challenge: this needs to work for files with 500K+ rows without crashing the browser tab.

The problem with naive rendering

Parsing an entire CSV into a string[][] and holding it in React state doesn’t scale:
RowsNaive approach (parsed arrays)Optimized approach
10K~4 MB~1 MB
100K~40 MB~9 MB
500K~200 MB~55 MB
1M~400 MB+~109 MB
JS arrays have significant per-object overhead — each row creates an Array object and each cell is a separate String allocation. At 500K rows this easily exceeds mobile browser limits.

Index + on-demand parsing

Instead of parsing everything upfront, csv-parser.ts takes a two-phase approach:
  1. indexCsv(text) scans the raw CSV text once and builds a lightweight index — an array of byte offsets marking where each row starts. For 500K rows, the offsets array is ~4 MB (one number per row at 8 bytes each) compared to ~200 MB for fully parsed arrays. The scanner handles \r\n line endings and quoted fields containing newlines.
  2. getRow(index, rowIdx) parses a single row on demand by looking up its byte offset, slicing the raw text, and parsing just that line. The returned string[] is temporary — used for rendering, then garbage collected. Only ~30 rows exist in memory at any time.

Virtualized rendering

The review table uses @tanstack/react-virtual with a div-based layout. Regardless of file size, only ~30 DOM elements exist at any time (the visible rows plus overscan). The virtualizer manages a container with height: totalRows * ROW_HEIGHT and positions visible rows with transform: translateY(...).

Email validation scan

Email validation runs once over all rows in a single pass. Each getRow call creates a temporary array that’s immediately eligible for GC after the check. The only persistent output is a Set<number> of invalid row indices — negligible memory. For 500K rows this takes ~200–500ms. If it becomes a bottleneck, the scan can be moved to a Web Worker or chunked with scheduler.yield() to keep the UI responsive.

Web Worker feasibility

The indexCsv function is a strong candidate for Web Worker offloading — it’s pure, self-contained, and has no DOM or React dependencies. The worker would receive the raw CSV text, build the index, and post back { headers, lineOffsets, totalRows }. The main thread keeps the text reference for getRow() slicing. Next.js supports new Worker(new URL(...)) with webpack 5. Key files: apps/web/src/lib/csv-parser.ts (indexing and parsing), apps/web/src/components/imports/review-preview.tsx (virtualized table).

Phase 3: Chunked Processing

Once the user confirms the column mapping, an Inngest background job takes over. The processing uses a chunked approach for reliability:
  1. download-and-chunk — streams the CSV from Vercel Blob through a parser, buffering rows into chunks of 500. Each chunk is stored in Redis with a 1-hour TTL.
  2. process-batch-N — each chunk is processed independently in its own step.run() call. This means if batch 3 fails, only batch 3 retries — not the entire import.
Within each batch, the system:
  • Maps CSV columns to contact fields using the confirmed mapping
  • Stores unmapped columns as custom_fields (JSONB)
  • Validates and normalizes email addresses
  • Fetches Gravatar avatars (with p-limit concurrency of 10)
  • Deduplicates by email within the batch (last row wins)
  • Upserts contacts in sub-batches of 50, using Postgres xmax = 0 to distinguish inserts from updates
  • Reports progress after each sub-batch
Invalid rows (missing or malformed email) are counted as failures, with up to 10 error samples collected for the notification email.

Phase 4: Finalize

After all batches complete, the finalize step:
  • Generates an AI summary of the import (with a graceful fallback if the API call fails)
  • Sends an email notification with import stats (added, updated, failed counts)
  • Sends a separate error report if the failure rate exceeds 5%
  • Invalidates the stats cache so the dashboard reflects the new data

Tracking Progress

The frontend polls the import status endpoint during processing. The UI shows:
  • Upload progress — real percentage via XMLHttpRequest.upload.onprogress
  • Processing progress — added and updated counts with an ETA calculation
  • Rotating joke messages — to keep things fun while you wait
  • Confetti — on completion (extra confetti for your first import)

Error Handling

The import is designed to be resilient:
ScenarioBehavior
AI mapping failsFalls back to null mappings (user maps manually)
Individual row has bad emailCounted as failed, added to error samples, processing continues
A batch step failsOnly that batch retries (up to 2 retries)
Entire job failsImport marked as failed, user notified
AI summary generation failsFallback summary string used, import still completes
Blob storage unreachableJob throws, Inngest retries the step