Chunking & content-addressing in review
How systems store and sync big files by their content, not their name, and why a one-byte edit re-sends the whole file under naïve chunking, but almost nothing under content-defined chunking.
This site already runs on content-addressing: every photo is served at a
content-stable URL with a ?v=N cache-bust, the service worker
treats /images/* as immutable (cache-first), and the originals
live in R2. That's the idea. The systems-grade version (chunk a
file by content, address each chunk by its hash, ship only the chunks that
changed, and verify a slice without the whole) is what
mkit
does. Everything below re-creates those primitives, self-contained and
dependency-free, right in your browser.
1 · The boundary problem
Type in the box, then hit Snapshot as v1 to freeze the current chunks. Now edit, especially Insert a line at the top. Watch the two strategies diverge: under fixed-size chunks every boundary shifts and almost every chunk's hash changes (so you'd re-upload the whole file); under FastCDC the boundaries ride the content, so only the chunks near your edit change.
Fixed-size chunks every 40 bytes: the naïve approach
FastCDC content-defined: min 16 · avg 40 · max 120 bytes
2 · FastCDC, briefly
A rolling gear hash slides over the bytes; wherever the hash's
low bits hit a target pattern, that's a chunk boundary. Because the cut
points follow the data, inserting or deleting bytes only
disturbs the boundaries you touched. The rest line up exactly as
before. min/max clamp pathological sizes;
"normalized" chunking tightens the mask before the average and
loosens it after, so sizes cluster near the target. ~60 lines, no deps:
3 · Big files
Content-addressing earns its keep on large blobs. Drop any file (it
never leaves your machine) and your browser FastCDC-chunks it locally
(avg 8 KB), then SHA-256's each chunk into a content ID. The blocks
run proportional to size so you can see the variable-length cuts; the
demo counts out identical chunks (a ChunkedBlob's dedup win).
4 · Delta wire format & verified slices
Once both sides address chunks by hash, sync is a set difference: the
sender lists its chunk IDs, the receiver names the few it lacks, and only
those bytes cross the wire: the delta wire format.
The last piece is trust: Bao verified slices wrap a
BLAKE3 tree so you can stream-verify any byte range against the
root hash without fetching the whole file. crypto.subtle has
no BLAKE3, so the demos above use SHA-256 for the content IDs; the shape of
the argument is identical, and mkit uses BLAKE3 + Bao for the real thing.
On the site: this is the same instinct as the ?v=N immutable
image URLs and the cache-first service worker: identity by content,
not by name. If the photo originals ever needed incremental sync to R2,
content-defined chunking is how you'd ship only what changed.
Inspired by mkit's streaming demo and MakeChain: a protocol for cryptographically-signed, permissionless version control.