How JPEG compression actually works, step by step

A walk through the JPEG encoder: color conversion, chroma subsampling, the discrete cosine transform, quantization, and Huffman coding.

May 24, 202610 min readBy Andy Feliciotti

What this post covers

The hero post on what JPEG artifacts are describes the four kinds of damage you see in compressed images. This one is the longer technical companion: what the encoder actually does to your image at every step, why each step is in the pipeline, and where the lossy behavior comes from.

If you only remember one thing: quantization is the lossy step. Everything else in JPEG is reversible. The "quality" slider in your image editor scales a single table that controls how aggressive that quantization is.

Lossy vs lossless

A lossless compressor encodes a file so that decoding it returns the exact same bytes. PNG, FLAC, and ZIP are lossless. A lossy compressor allows the decoded output to differ from the input, in exchange for a much smaller file.

JPEG is lossy because the cost of perfect fidelity is enormous. A modest 4000×3000 photograph holds 12 million pixels, each with three 8-bit color channels, for 36 MB of raw data. A high-quality JPEG of the same image is around 4 MB. That is roughly a 9× reduction, achieved entirely by throwing away information your eye is least likely to miss.

JPEG was standardized in 1992 as ISO/IEC 10918-1 (also published as ITU-T Recommendation T.81). The reference paper by Gregory K. Wallace, The JPEG Still Picture Compression Standard (CACM, 1991), is still the clearest summary of how the encoder works.

The full pipeline

A JPEG encoder runs the same six operations on every image, in this order:

Convert RGB to Y′CbCr.
Subsample the chroma channels.
Split each channel into 8×8 blocks.
Run a 2D discrete cosine transform on every block.
Divide each coefficient by a quantization table value, then round.
Encode the result with run-length and Huffman coding.

Steps 1, 2, 3, 4, and 6 are reversible. Step 5 is where information is lost.

Step 1: RGB to Y′CbCr

JPEG separates an image into one luminance channel (Y′, brightness) and two chrominance channels (Cb and Cr, blue-difference and red-difference). The math is a simple linear transformation per pixel:

Y'  =  0.299·R + 0.587·G + 0.114·BCb  = -0.169·R - 0.331·G + 0.500·B + 128Cr  =  0.500·R - 0.419·G - 0.081·B + 128

Y'  =  0.299·R + 0.587·G + 0.114·BCb  = -0.169·R - 0.331·G + 0.500·B + 128Cr  =  0.500·R - 0.419·G - 0.081·B + 128

Human vision is much more sensitive to brightness than to color. A small error in Y′ is visible; the same error in Cb or Cr usually is not. Splitting the image lets the encoder treat the two kinds of information separately and spend more bits on the channel that matters most.

The conversion itself is lossless (modulo small floating-point rounding). RGB and Y′CbCr are equivalent representations of the same data.

Step 2: chroma subsampling

This step exploits the brightness-over-color sensitivity directly: the encoder downsamples the chroma channels but keeps luminance at full resolution. The most common ratio is 4:2:0, which means for every 2×2 block of luma samples, only one Cb and one Cr sample is kept. The chroma channels end up at one quarter the spatial resolution of the luma channel.

JPEG also supports 4:2:2 (chroma halved horizontally only), 4:1:1 (chroma reduced by 4× horizontally), and 4:4:4 (no subsampling). Photoshop's Save for Web defaults to 4:2:0 above quality 50 and 4:4:4 below it. Most modern photo apps default to 4:2:0.

Subsampling is technically lossy because you cannot reconstruct the four original chroma samples from a single averaged one. In practice, the visible cost is minor on photographs and severe on saturated, high-contrast edges, which is why a red shirt against a blue sky tends to bleed at low quality settings.

Step 3: split into 8×8 blocks

Every channel is sliced into 8×8 pixel blocks (sometimes called MCUs, minimum coded units, especially when chroma is subsampled and the luma block size differs from the chroma block size). Every block is processed independently from then on. The encoder has no way to look at its neighbors.

This independence is the direct cause of blocking artifacts. At low quality settings, the rounding inside one block does not match the rounding in the next, and the seams become visible as a checkerboard pattern.

If your image's dimensions are not divisible by 8, the encoder pads the right and bottom edges with replicated pixels. Those padding pixels exist in the encoded file but the decoder discards them.

Step 4: discrete cosine transform

This is the most mathematically interesting step. Each 8×8 block of pixel values is converted into an 8×8 block of DCT coefficients. The transform is invertible: applying its inverse to the coefficients gives back the original block exactly.

Conceptually, the DCT expresses the block as a sum of 64 basis patterns: one flat (DC), and 63 wavy patterns of progressively higher spatial frequency. The coefficient at position (0, 0) is the average brightness of the block (the DC coefficient). The coefficient at (7, 7) is the contribution of the highest-frequency diagonal pattern.

For natural images, most of the energy concentrates in the low-frequency coefficients. The DC value plus a handful of low-frequency AC coefficients is usually enough to describe the block visually. The high-frequency coefficients are small and contribute mostly to fine texture.

The transform itself is lossless. It just rotates the same data into a basis where the next step can throw away whichever components matter least.

Step 5: quantization

This is the lossy step. Each coefficient is divided by a value from a fixed 8×8 quantization table, and the result is rounded to the nearest integer.

JPEG uses one quantization table for luminance and another for chrominance. The standard publishes example tables in Annex K of T.81. The table entries are larger at high-frequency positions, which means high-frequency coefficients get rounded more aggressively. After rounding, many of them are zero.

Here is the canonical Annex K luminance table:

 16  11  10  16  24  40  51  61 12  12  14  19  26  58  60  55 14  13  16  24  40  57  69  56 14  17  22  29  51  87  80  62 18  22  37  56  68 109 103  77 24  35  55  64  81 104 113  92 49  64  78  87 103 121 120 101 72  92  95  98 112 100 103  99

 16  11  10  16  24  40  51  61 12  12  14  19  26  58  60  55 14  13  16  24  40  57  69  56 14  17  22  29  51  87  80  62 18  22  37  56  68 109 103  77 24  35  55  64  81 104 113  92 49  64  78  87 103 121 120 101 72  92  95  98 112 100 103  99

The smallest value sits at (0, 0), so the DC coefficient is barely touched. The largest values sit at the bottom-right, so the highest-frequency coefficients get divided by close to 100 before rounding, which usually rounds them to zero.

The "quality" slider in your image editor controls a single scaling factor applied to this table. At quality 100, the table is divided down so much that almost no coefficient gets rounded. At quality 10, the table is multiplied so much that all but the lowest-frequency coefficients become zero.

Two practical consequences:

High-frequency detail goes first. That is why fine textures and sharp edges suffer at low quality. The encoder is literally instructed to throw the high-frequency components away first.
There is no single "quality" scale. Photoshop maps its 0–12 slider to a quality factor, libjpeg uses 0–100, and Adobe Camera Raw uses 1–12. Different tools produce different quantization tables at the same numeric setting. The JPEG quality settings guide goes deeper on this.

Step 6: Huffman coding

After quantization, each 8×8 block is read in a zigzag order that starts at the DC coefficient and snakes outward toward the high-frequency corner. Because the high-frequency coefficients are mostly zero, the zigzag produces long runs of zeros that compress well under run-length encoding.

The result is then Huffman coded using tables also published in Annex K. The DC coefficient is encoded as a difference from the previous block's DC value (this is the only point where blocks talk to each other in the encoder). The AC coefficients use a different Huffman table.

The entire entropy coding stage is lossless. It is just a smaller way to write the quantized integers down.

The file structure

A .jpg file is a sequence of segments introduced by 0xFF marker bytes. The important ones in order:

Marker	Meaning
`FFD8` (SOI)	Start of image
`FFE0`–`FFEF` (APPn)	Application-specific data (Exif, color profile, thumbnails)
`FFDB` (DQT)	Define quantization table
`FFC0` (SOF0)	Start of frame: dimensions and chroma layout
`FFC4` (DHT)	Define Huffman table
`FFDA` (SOS)	Start of scan: the actual encoded image data
`FFD9` (EOI)	End of image

If you open a JPEG in a hex editor, you can see this structure directly. The DQT segment contains the exact quantization tables the encoder used, which is why you can identify the source software of a JPEG with reasonable accuracy by inspecting them.

Why "progressive" JPEGs exist

A standard ("baseline") JPEG stores its data in scan order: the file is read from top to bottom, and the decoder can only display lines that have arrived. A progressive JPEG reorders the same coefficients across multiple scans, lower-frequency first, so that an early partial download gives a blurry version of the whole image rather than a sharp top half.

Progressive encoding does not change the quality. It is a different way to write the same compressed bytes for better perceived loading on slow connections. mozJPEG, Mozilla's modernized libjpeg fork, encodes progressively by default and is the reason "save for web" output got smaller without quality changes around 2014.

Putting it together

If you trace one pixel through the pipeline: a color value gets averaged with its three neighbors (subsampling), participates in a 64-coefficient transform (DCT), has each coefficient divided by a number from a fixed table and rounded (quantization), and is written down as part of a Huffman-coded zigzag scan. The reverse on decode reconstructs an approximation.

The smaller your quality setting, the larger the numbers in step 5's table, and the more zeros you end up with in step 6. The file shrinks because the encoded data is mostly zeros, and your image gets the four kinds of damage we covered in the explainer.

FAQ

Why does the DCT use cosines and not sines?

The DCT-II (the variant JPEG uses) is closely related to the discrete Fourier transform but uses only cosines because it implicitly assumes the input is even-symmetric. That assumption gives the transform better energy compaction on natural images than a DFT would, with no complex numbers.

Is the DCT the same as the FFT?

No. The DCT is a real-valued cousin of the discrete Fourier transform. JPEG uses the DCT specifically because, for natural images, it concentrates energy into fewer coefficients than the FFT does. The DCT also avoids edge-artifact problems that plague FFT-based image compression.

Can JPEG support 12-bit or 16-bit images?

Baseline JPEG is 8-bit per channel. The standard also defines a 12-bit extended mode, but it has poor tool support and is rarely used. For higher bit depths, JPEG 2000, JPEG XL, AVIF, or HEIC are the right answers.

Why does the encoder use one quantization table for the whole image?

Because the table is encoded once in the file, instead of per-block. Custom encoders (mozJPEG, guetzli) experiment with per-block adjustment to reduce visible damage in important regions, but the baseline standard uses a single table per channel.

What is "JPEG 2000" and why does no one use it?

JPEG 2000 (released 2000) replaced the DCT with a wavelet transform and produces smaller files at the same visual quality. Patent uncertainty around the wavelet codec stalled adoption, and by the time the patents expired, WebP and AVIF had taken over the "next-gen image format" slot. JPEG 2000 is still common in medical imaging and digital cinema.

How does JPEG XL relate to JPEG?

JPEG XL was finalized as ISO/IEC 18181 in 2022. It includes a "JPEG transcoding" mode that re-encodes existing baseline JPEGs losslessly, producing files about 20% smaller that decode back to the original bytes. For new images it offers a more capable lossy/lossless codec with much better quality per byte. The format comparison post covers it in context.

Sources

Wallace, G. K. (1991). The JPEG Still Picture Compression Standard. Communications of the ACM, 34(4), 30–44.
ISO/IEC 10918-1:1994 / ITU-T Recommendation T.81, the JPEG standard. Annex K contains the example quantization and Huffman tables.
Independent JPEG Group (libjpeg), the reference open-source encoder.
mozJPEG, Mozilla's modernized libjpeg fork.
Alakuijala, J. et al. (2019). JPEG XL next-generation image compression architecture and coding tools. JPEG Committee white paper.