bytes & pixels
Starter draft. This grew out of the garage encoding study, where one photo runs through every encoder and the bytes get counted. The conversation is a first pass; the demo fetches the study's real output files and measures their actual sizes.
the garage page runs one photo through jpeg, jpegli, and avif and prints three different sizes. before I trust those numbers, tell me why one image even gives three sizes.
An encoder throws away the detail your eye skips, then packs what survives. The three differ in how hard they push each step.
JPEG (1992) cuts the photo into 8x8 blocks, runs a cosine transform, and rounds off the high-frequency parts. jpegli (Google, 2024) keeps the exact JPEG format but rounds smarter, so it fits roughly 25% more quality into the same bytes as the old mozjpeg encoder. AVIF borrows AV1's video tricks, larger blocks and sharper prediction, so it usually wins on bytes-per-pixel outright.
so jpegli writes a normal .jpg that any browser opens, just smaller?
Right. jpegli stays inside the JPEG bitstream, so a browser from 1995 still decodes it. AVIF needs a modern decoder (Safari 16+, Chrome 85+). That trade, universal but larger against modern but smaller, is why the photo grid ships both and lets the browser pick.
ok, stop describing it. show me the bytes on one photo.
The same 400×266 photo (106,400 pixels), run through every encoder. These sizes are fetched live from the files the garage study actually produced, so the bytes are real. The last column is size against the lossless PNG.

| encoder | size | b/px | vs PNG |
|---|---|---|---|
| measuring real files… | |||
so on this photo jpegli is about half the baseline JPEG, and avif is smaller still. what's the catch with avif then?
Decode support and encode time. AVIF leans on the AV1 codec, which is slower to encode and only decodes on recent browsers. jpegli is fast and universal. That is why the site builds jpegli from source for the universal fallback and uses AVIF as the primary <picture> source: the browser reaches for AVIF, and anything that cannot decode it falls back to the jpegli .jpg.
and the grayscale shots? why are those so much smaller?
A color image carries one luma plane (brightness) plus two chroma planes (color). Drop to grayscale and you delete both chroma planes, roughly two-thirds of the color information, so the file falls hard. The garage study has the side-by-side counts.
so color is two of the three planes. JPEG and AVIF don't even keep color at full resolution, right? that's the subsampling thing?
Right, chroma subsampling. Your eye resolves brightness far better than color, so codecs keep luma at full resolution and shrink the two chroma planes. 4:4:4 keeps everything; 4:2:2 halves chroma horizontally; 4:2:0, the web default, stores one chroma sample per 2x2 block, a quarter of the color resolution. The luma carries the sharpness, so you barely notice.
The top band is fine luma detail (black and white lines); the bottom is fine chroma detail (red and green lines) at the same spacing. Switch the mode: luma stays razor sharp at every setting, while the color detail dissolves as you drop chroma resolution.
wild, the black-and-white lines stay razor sharp at 4:2:0 but the red-green lines just dissolve. and that's half the samples gone.
Exactly, and for photographs it is almost free, because real scenes rarely put fine high-contrast color edges right next to each other. Where it shows is red text on a dark background or saturated line art, which is why screenshots and graphics keep 4:4:4 while photos ship 4:2:0.
earlier you said jpegli just "rounds smarter." smarter how, if it writes the same JPEG format?
It models your eye instead of trusting the fixed 1992 quantization tables. jpegli works in the XYB color space, which spaces colors the way you perceive them, sets the quantization adaptively per block by what you would actually notice, and rounds with adaptive dead-zones. The output is still a standard JPEG that any decoder reads, but more bits land where your eye looks and fewer on detail you would never see. That is the rough 25% it gains over the old mozjpeg encoder.
ok, I want to SEE these tradeoffs, not just read byte counts. show me one zoomed crop across formats and qualities.
Here is a 96-pixel detail crop, the car's body edge against the dark wheel, run through every encoder and blown up so the artifacts show. Start with format against quality.
One 96-pixel detail crop (the body edge against the dark wheel), run through three formats at three quality tiers and shown pixel-zoomed. Read across a row to compare formats at one quality; read down to watch a format fall apart as quality drops.

…

…

…

…

…

…

…

…

…
At 96 pixels the absolute bytes mostly reflect each format's fixed container overhead (WebP's is lightest, AVIF's heaviest), not the per-pixel efficiency that flips on full images, which is what the live measurement above shows. What the zoom reveals is the artifact style: JPEG breaks into 8x8 blocks, while WebP and AVIF smear and smudge instead.
jpeg goes blocky, the others go smeary. now the jpeg encoders you mentioned, mozjpeg vs the google one?
Same crop, same quality knob, three encoders. The bytes under each tell the story.
The same crop, the same quality setting (q72), three JPEG encoders. All three write the same JPEG format any decoder reads; the only difference is how cleverly each spends its bits. Watch the byte counts.

…

…

…
Same quality knob, real bytes measured live: jpegli comes out smallest, well under both mozjpeg and the system baseline, because it models your eye (XYB color, adaptive per-block quantization) and stops spending bits on detail you would not notice. That is the psychovisual win in one crop.
jpegli really is smaller at the same setting. and chroma, up close on a real edge?
The same crop as a JPEG at the three chroma samplings. Luma holds; the color is what softens.
Chroma subsampling on a real edge instead of stripes: the same crop saved as a JPEG at 4:4:4, 4:2:2, and 4:2:0. The luma edge holds at every setting; the color is what loses resolution.

…

…

…
The size drop is modest here (4:4:4 to 4:2:0) because this crop is mostly grays. On saturated content the savings grow and the color fringing gets obvious, which is the stark version in the stripes demo above.
so what does the site actually ship per thumbnail?
Each thumbnail is dual-encoded: an AVIF primary and a jpegli JPEG fallback inside one <picture>, plus a 400px AVIF tier for phones. The browser loads the smallest format it can decode and never downloads the others.
got it. one photo, several encoders, and the grid hands each browser the cheapest thing it can read. thanks.
→ the full garage encoding study · back to Learning With Errors