From 30% to 99.7%: Rebuilding an Upload System That Was Losing Customer Data

The Hardest Part Was the Browser Itself#

A geospatial platform was losing customers over upload failures. Their system handled LiDAR point clouds, panoramic imagery, and trajectory data — large binary files that downstream processing pipelines consume. Upload completion for files over 1GB sat around 30%. Some enterprise teams had given up entirely and were shipping data on external hard drives.

I partnered with their engineering team for 8 weeks to rebuild the upload system from scratch. The constraint: no Electron shell, no native helper app, no desktop agent. Just the browser. That meant building what amounts to a server inside a browser tab — its own database, worker pool, crash recovery, and multi-tab coordination. By the end, upload reliability was at 99.7%.

The pain points were specific:

No resume capability. A 4GB file that fails at 95% starts over from byte zero. Users learned to dread large files.
Memory blowouts. Reading a multi-gigabyte file into an ArrayBuffer before uploading consumed all available RAM, crashing the tab and sometimes the entire browser.
Silent failures. An upload would appear to complete, but the server had only received a partial file. Nobody noticed until the data processing pipeline choked.
No progress visibility. Users would kick off a batch, switch to another tab for 20 minutes, come back, and have no idea which files succeeded, which failed, and why.
Data loss from any disruption. Network drops, permission revocations, auth token expiry, S3 throttling, or simply closing the tab. Any interruption meant losing all in-flight progress and starting over.

This was not a backlog item. It was a retention risk. The upload experience had to go from "mostly works for small files" to "reliable regardless of file size, network quality, or user behavior."

Why No Existing Solution Worked#

The default approach for browser uploads — <input type="file">, read the file, POST it — breaks at scale. Files over 1GB exhaust browser memory. A single HTTP request offers no resume. Progress state lives in JavaScript variables and vanishes when a tab crashes. Multiple tabs compete for the same resources with no coordination.

My first instinct was to not build this from scratch. I evaluated Uppy, FilePond, and several S3-specific upload SDKs. Uppy and FilePond are excellent tools. At smaller scale, either would have been sufficient. But at this scale, specific things break:

Uppy's crash recovery breaks with S3 multipart. Golden Retriever saves file metadata, but a browser crash aborts the S3 multipart upload ID. On restart, it tries to resume with a stale upload ID and fails. This is a documented limitation.
Uppy's per-part retry restarts the entire upload. When one part fails, all previously uploaded parts are discarded and the whole multipart upload starts over. At 10,000 parts per file, this is catastrophic.
No library handles persistent file handles that survive page reloads without re-prompting the user to re-select files. Uppy's crash recovery uses IndexedDB and Service Workers, not persistent FileSystemHandles.
No library provides multi-tab coordination: leader election, cross-tab sync, and automatic failover when a tab closes. No such plugin exists in Uppy's ecosystem.
No library handles 100k+ file queues at the scale this platform required. Even adding ~30k files to Uppy takes minutes just to render.

No existing solution addressed all of these simultaneously. A custom system was the only path forward.

A Deliberate Platform Constraint: Chromium Only#

One of the first decisions I made with the team: target Chrome and Chromium-based browsers (Edge, Arc, Brave, Opera) exclusively. Not Firefox, not Safari. The APIs required for true crash recovery and durable persistence — the File System Access API and OPFS SyncAccessHandle — only exist in Chromium. Firefox has had the File System Access API behind a flag for years with no signal of shipping; Safari added partial support but lacks the permission persistence model.

The alternative was building a dedicated desktop application or shipping a cross-browser fallback that could not deliver the reliability guarantees the project demanded. Chromium-only kept the system purely web-based while unlocking the full capability set. In unsupported browsers, the upload panel simply does not mount — the user sees a clear message explaining the requirement, not a broken UI.

What Was Delivered#

The final system is a custom browser-native upload engine with capabilities that no off-the-shelf library provides. Here is what it does.

Off-Main-Thread Architecture#

All upload logic runs in dedicated Web Workers. The main thread (React UI) never touches a file, never makes an S3 request, never writes to the database. The UI stays completely responsive — no jank, no freezes — regardless of how many files are uploading or how large they are. If the React app crashes, the workers can be respawned and resume from the last checkpoint.

Crash-Resilient Persistence#

The system maintains a full checkpoint of every in-flight upload using SQLite compiled to WebAssembly running inside a worker, backed by the browser's Origin Private File System. This gives real relational query capability — not just key-value storage — that persists across browser sessions and survives tab crashes.

File handles are stored separately in IndexedDB, allowing the system to re-access files after a page reload without prompting the user to re-select anything. The combination means that after a browser crash, a tab close, or even a power failure, the system knows exactly which files were mid-upload, where each one left off, and can re-access the original files on disk — all without user intervention.

Multi-Tab Coordination#

Multiple browser tabs sharing the same origin means multiple instances of the upload system competing for the same database and the same S3 upload IDs. The system implements a leader/follower model using the Web Locks API: only one tab runs the upload engine at any time. All other tabs display live progress summaries via BroadcastChannel.

If the leader tab closes or crashes, the OS releases the lock automatically — no heartbeat timeout, no stale-leader problem. A follower is promoted to leader within seconds, checks for interrupted uploads, and auto-resumes if the engine was actively uploading. At no point do two tabs run the upload engine concurrently. The entire transition is invisible to the user beyond a brief toast notification. The design draws on Notion's approach to running SQLite in the browser.

Intelligent File Discovery#

Users can drag-and-drop files or entire folder trees, or use explicit picker dialogs. The system handles both identically — recursively walking directory trees, classifying files by type, rejecting unsupported formats early, and queueing everything into the upload pipeline. It handles 100k+ file queues without degrading UI performance, with immediate visual feedback as files are discovered.

File filtering happens at discovery time, not upload time. The system tracks rejection counts so the UI can show "47 files skipped (unsupported format)" rather than silently ignoring them. Catching invalid files early avoids wasting upload bandwidth on files the server would reject anyway.

Resumable Multipart Transfers#

Every file is split into parts and uploaded via S3 multipart upload. Each completed part is checkpointed, and on resume the system verifies that reported progress matches actual uploaded data — ensuring a file that crashed at 95% only re-uploads the remaining 5%, not the entire file.

The system uses zero-copy file streaming — slicing byte ranges by reference rather than loading entire files into memory — keeping memory footprint constant at ~50MB regardless of file size.

Comprehensive Failure Handling#

A production upload system encounters failures constantly. The system was designed around every failure scenario we could identify: network drops mid-transfer, server errors, S3 rate limiting, expired authentication tokens, browser-revoked file permissions, files moved or deleted from disk after selection, and browser crashes during active uploads.

Every scenario has an appropriate automated response. Transient failures are retried automatically. Failures that require user action — like re-granting file access — surface clear prompts explaining what happened and what to do. Failures that affect an entire directory pause the full engine rather than failing files one by one, so the user fixes the root cause once instead of dismissing errors repeatedly.

Retry state is persisted to the database, so even a browser crash during a backoff wait doesn't lose track of what needs to be retried.

Accurate Progress and ETA#

Progress speed and ETA calculation are deceptively hard. Naive approaches produce wildly fluctuating numbers. The system uses smoothing algorithms tuned for stability over jittery connections — the speed display stays stable while the ETA stays responsive. During gaps between files, the display holds the current estimate rather than flickering to zero.

Known Limitations#

The system operates within three constraints: File System Access API permissions are session-scoped by default in Chrome (though persistent permissions are available since Chrome 122+), OPFS storage quota is origin-scoped and browser-managed (periodic cleanup of completed records is necessary for very large upload histories), and the cross-origin isolation headers required for OPFS can restrict third-party embeds that don't set appropriate CORS headers.

Results and Business Impact#

What Changed#

Upload completion rate: ~30% → 99.7% for files over 1GB. The remaining 0.3% are genuine infrastructure failures.
Memory footprint: Constant ~50MB regardless of file size, down from 5GB+ for a single large upload.
Crash recovery: Resume picks up from the exact part, validated against S3. No progress lost.
Concurrent throughput: 3 files × 5 parts = 15 concurrent HTTP streams, fully utilizing available bandwidth.
Multi-tab coordination: Leader/follower model across tabs. If the leader tab closes, a follower promotes and resumes within seconds.

Users stopped transferring data via external hard drives. The platform became the default upload path again. Users could close their laptop, reopen it the next day, and resume exactly where they left off. Support tickets related to upload failures dropped to near-zero within the first month.

No upload server. The entire engine runs in the browser, eliminating upload proxy infrastructure.
Retention risk eliminated. Upload reliability was the #1 complaint from enterprise accounts.
File size ceiling removed. The architecture supports files up to 1TB. The previous system maxed out around 500MB.

I also ran pair-programming sessions with the internal team and produced architecture decision records so they could own and evolve the system independently after the engagement ended. The goal was never to create a dependency — it was to leave the team in a stronger position than I found them.

What This Engagement Proved#

Browser-native can compete with desktop. With the right APIs and architecture, a browser tab can deliver the reliability and performance that previously required Electron or a native helper app. The key is knowing which platform capabilities to commit to and building deeply on them.
Resilience comes from treating every disruption as normal. Network drops, tab crashes, permission revocations, auth expiry — none of these are edge cases. They are the default operating environment. A system designed for the user who walked away handles every other scenario by default.
The hardest problems are at the intersection of browser APIs. No single API was insufficient. The challenge was composing File System Access, Web Workers, Web Locks, OPFS, BroadcastChannel, and S3 multipart into a coherent system where each component's failure modes are handled by another.