VOD Deep Dive Part 2: Video Codecs — Why a 4K Movie Fits in 5 GB

How video compression works, why H.264 still dominates, when to choose H.265 or AV1, per-title encoding, VMAF quality metrics, and hands-on ffmpeg examples.

zhuermu · · 25 min
vodstreamingcodech264h265av1ffmpeg

This is Part 2 of the VOD Streaming Deep Dive series.


Let’s Do the Math: How Big Is Uncompressed Video?

From Part 1, we know:

A 1080p, 30 fps, YUV 4:2:0, 8-bit uncompressed video stream:

Per-second size = 1920 × 1080 × 1.5 (YUV 4:2:0) × 30 ÷ 1024² ≈ 89 MB/sec

So:

  • 1 minute ≈ 5.3 GB
  • 90-minute movie ≈ 480 GB
  • 4K movie (4× the pixels) ≈ 1.9 TB

A real Netflix 4K movie is 5–15 GB. That means:

Encoding compresses video to less than 1% of its raw size.

Not magic — decades of mathematics. Here’s how.


Encoding and Decoding: Two Sides of a Coin

Raw video (89 MB/s)          Compressed video (2 MB/s)        Display
    ┃                             ┃                            ┃
    ▼                             ▼                            ▼
 ┌──────┐    encode           ┌──────┐    decode            ┌──────┐
 │ Raw  │  ──────────────►   │ File │  ────────────────►   │ Play │
 │ file │    (x264/x265)     │      │   (player/hardware)  │      │
 └──────┘                    └──────┘                      └──────┘
  • Encoding: Compress a large file into a small one → slow, CPU/GPU-intensive
  • Decoding: Reconstruct the image from the compressed file → fast, phones have dedicated hardware

Encoder + Decoder = Codec (coder-decoder).

Why is encoding so much slower? Encoding explores all possibilities to find the optimal compression. Decoding just follows the instructions. Like packing an oddly-shaped item into a box versus opening the box.


The Two Axes of Video Compression

Axis 1: Intra-frame Compression

Compress each image on its own — similar to JPEG:

  • The human eye is sensitive to low-frequency information (large color blocks) but not high-frequency detail (noise, fine edges)
  • DCT (Discrete Cosine Transform) converts pixels from spatial to frequency domain
  • Unimportant high-frequency coefficients are quantized away

This produces I-frames — each independently decodable.

Axis 2: Inter-frame Compression

Exploit the fact that adjacent frames are nearly identical — only store differences.

Frame N (I-frame): Complete image
┌─────────────┐
│   🚗         │
│  ___________ │  ← Stored in full
└─────────────┘

Frame N+1 (P-frame): Only the difference
"Move the car in frame N 15 pixels to the right"
─────────► A few bytes to describe

The technique: Motion Estimation + Motion Compensation:

  1. Divide the frame into small blocks (macroblocks, typically 16×16 or 8×8)
  2. For each block, search the previous frame for the best match
  3. Record only the motion vector (how far it moved) + residual (tiny remaining difference)

Inter-frame compression is vastly more efficient than intra-frame — this is why video is orders of magnitude smaller than a JPEG image sequence.


Additional Codec Techniques (Know They Exist, Don’t Memorize)

TechniqueWhat it doesEffect
TransformDCT / Integer Transform: pixels → frequency coefficientsPrepares data for quantization
QuantizationDivide coefficients by an integer, roundMain quality/compression knob
Entropy CodingCABAC / CAVLC: use fewer bits for common symbolsLossless final squeeze
In-loop FilterRemove blocking artifactsSmoother image
SAO / ALF (H.265+)Adaptive sample offsetReduces edge artifacts
Multi-referenceP/B frames can reference multiple past framesBetter prediction → smaller residuals

In practice, you control these through encoder parameters — no need to implement them yourself.


The Five Major Codecs

H.264 / AVC (2003): The Universal Gold Standard

  • Compatibility: virtually every device that plays video supports it
  • Compression: our baseline for comparison
  • Patents: paid (MPEG LA pool), but industry default
  • Best for: maximum compatibility, constrained compute

H.264 is 20+ years old and still the fallback codec for YouTube, Facebook, and Zoom.

H.265 / HEVC (2013): Better Compression, Licensing Nightmare

  • Compression: ~37% bitrate savings vs H.264 at equivalent quality
  • Patents: chaotic and expensive — three patent pools (MPEG LA, HEVC Advance, Velos Media) + many un-pooled patents
  • Hardware decode: iPhone 7+ (2016), Android 6+ flagships, most 4K TVs
  • Best for: Apple ecosystem, 4K streaming, bandwidth-sensitive scenarios

The licensing mess is why HEVC adoption on the web was so slow. Chrome and Firefox both resisted adding support.

VP9 (2013): Google’s Free Alternative

  • By: Google (acquired On2 Technologies)
  • Compression: close to H.265
  • Patents: royalty-free
  • Primary use: YouTube, Google Meet
  • Caveat: iOS does not natively support VP9

AV1 (2018): The Royalty-Free Next Generation

  • By: AOMedia Alliance (Google, Netflix, Meta, Amazon, Cisco, Microsoft, Intel, Apple, and more)
  • Compression: ~53% savings vs H.264, ~25% better than H.265
  • Patents: royalty-free
  • Hardware decode: iPhone 15 Pro+ (2023), Pixel 6+, Snapdragon 8 Gen 2+, Intel Arc, NVIDIA RTX 40+
  • Encoding speed: early SVT-AV1 implementations dramatically improved; CPU encoding is ~2–5× slower than H.265

Netflix, YouTube, TikTok, and Meta are all moving toward AV1 as the primary codec.

H.266 / VVC (2020): Latest Generation, Not Yet Mainstream

  • Compression: ~78% savings vs H.264, ~25–30% better than AV1
  • Hardware: flagships starting in 2024; consumer coverage still low
  • Status: wait and see

Comparison Table

H.264H.265VP9AV1H.266
Year20032013201320182020
Savings vs H.264baseline37%~30%53%78%
Encode speedFastestMidMidSlowSlowest
Decode loadLightestMidMidHigherHeavy
CompatibilityUniversalVery goodGood (web)GrowingLow
RoyaltiesPaidHigh + messyFreeFreePaid

These percentages come from specific test sets and conditions (BD-rate with VMAF/PSNR). Real-world results vary significantly by content type (animation vs. live action vs. screen recording). Don’t use them as absolute marketing claims.


How to Choose a Codec in Practice

Where do your users watch?

├── ① Web browsers + all phones + legacy set-top boxes
│     → Must have H.264 (fallback)
│     → Add H.265 (iOS) + AV1 (Android flagships / modern Chrome) for bandwidth savings

├── ② Native app only (iOS + Android + optional TV)
│     → H.264 + H.265 as primary; AV1 gradual rollout by device capability

├── ③ Web-first, global bandwidth cost matters (YouTube/Netflix scale)
│     → AV1 primary + H.264 fallback

└── ④ 4K / HDR premium content
      → H.265 / AV1 + Dolby Vision

A Typical Short-Form Video Encoding Ladder

TierResolutionH.264 bitrateH.265 bitrateAV1 bitrate
Low360p500 kbps350 kbps250 kbps
Mid540p900 kbps650 kbps480 kbps
Main720p1.5 Mbps1.0 Mbps750 kbps
High1080p3.5 Mbps2.2 Mbps1.6 Mbps

Hands-On: Compress a Video with ffmpeg

Basic H.264

ffmpeg -i input.mov -c:v libx264 output.mp4

CRF Quality Control

ffmpeg -i input.mov \
  -c:v libx264 -preset medium -crf 23 \
  -c:a aac -b:a 128k \
  output.mp4
ParameterMeaning
-preset mediumSpeed/compression trade-off. Options: ultrafast → veryslow. Slower = smaller file at same quality
-crf 23Quality target, 0–51. Lower = better quality, larger file. Default is 23.
-c:a aac -b:a 128kAudio: AAC at 128 kbps

CRF quick reference: 18 = visually lossless, 23 = high quality, 28 = acceptable (visible compression).

Stream-Ready VOD Encoding

ffmpeg -i input.mov \
  -c:v libx264 -preset slow -crf 22 \
  -profile:v high -level 4.0 \
  -g 60 -keyint_min 60 -sc_threshold 0 \
  -c:a aac -b:a 128k \
  -movflags +faststart \
  output.mp4
ParameterWhy
-g 60 -keyint_min 60One I-frame every 60 frames. At 30 fps = 2-second GOP, aligns with segmentation.
-sc_threshold 0Disable scene-cut auto I-frame insertion. Ensures all bitrate tiers have I-frames at the same positions.
-movflags +faststartMove the MP4 “table of contents” (moov box) to the start of the file — enables progressive playback. See Part 4.
-profile:v high -level 4.0Compatibility: Level 4.0 supports up to 1080p30.

H.265 Encoding

ffmpeg -i input.mov \
  -c:v libx265 -preset medium -crf 26 \
  -tag:v hvc1 \
  -c:a aac -b:a 128k \
  -movflags +faststart \
  output_hevc.mp4

Note: H.265 CRF values need to be ~3–5 higher than H.264 for equivalent visual quality (CRF 26 ≈ H.264 CRF 22). The -tag:v hvc1 tag is required for Apple devices to recognize the file.

Compression Results (1-min 4K source, ~2 GB raw)

CommandOutput sizeRatio
Uncompressed YUV~2 GB100%
H.264 CRF 23~30 MB1.5%
H.264 CRF 18~80 MB4%
H.265 CRF 26~18 MB0.9%
AV1 (SVT-AV1 preset 8)~12 MB0.6%

100–500× compression, virtually indistinguishable on a phone screen.


Per-Title and Per-Shot Encoding

The default approach is a fixed bitrate ladder for all content. But:

  • A cartoon (flat colors, little detail) looks great at 1080p@1000k
  • A concert (flashing lights, fast motion) still shows compression at 1080p@5000k

Per-Title Encoding (Netflix, 2015): calculate the optimal bitrate ladder for each title based on its visual complexity.

Traditional: Same ladder for every movie
  360p@500k / 720p@1500k / 1080p@4000k

Per-Title: Custom ladder per movie
  Cartoon:  360p@300k / 720p@800k / 1080p@1800k   (saves money)
  Concert:  360p@700k / 720p@2200k / 1080p@5500k  (needs more bits)

Per-Shot Encoding (Netflix, 2018) goes further: split a movie by scene cuts, then optimize each shot independently using VMAF as the quality target. Claims 17% additional savings at equal quality.

For most platforms, cloud providers’ “smart transcoding” templates (AWS MediaConvert QVBR, Alibaba Cloud “Narrowband HD”) deliver most of the benefit without building this yourself.


Hardware vs. Software Encoding

ApproachImplementationSpeedQualityBest for
Software (CPU)libx264 / libx265 / SVT-AV1SlowBestVOD offline transcoding
Hardware (GPU/ASIC)NVIDIA NVENC, Intel QSV, Apple VideoToolbox5–50× fasterSlightly lowerLive streaming, real-time

VOD should use software encoding: you only encode once, but the bandwidth savings last forever. Live streaming must use hardware encoding — you can’t spend 10 seconds encoding 1 second of video.


VMAF, PSNR, SSIM: Measuring Visual Quality

MetricFull nameMethodRangeCorrelation with human perception
PSNRPeak Signal-to-Noise RatioPixel-level difference0–∞ dB (higher = better)Weak
SSIMStructural SimilarityLuminance/contrast/structure0–1 (higher = better)Medium
VMAFVideo Multi-Method Assessment FusionML fusion of multiple features0–100 (higher = better)Strong

VMAF (open-sourced by Netflix) is the industry standard for perceptual quality:

  • VMAF ≥ 93: Visually lossless
  • VMAF ≈ 80: High quality
  • VMAF ≈ 60: Acceptable
  • VMAF ≤ 40: Obvious compression artifacts

Key Takeaways

  1. Video compression achieves <1% of raw size through intra-frame (compress each image) + inter-frame (store only differences) compression.
  2. Encoding is slow, decoding is fast. Encoder + Decoder = Codec.
  3. H.264 is the universal fallback. H.265 saves 37% but has messy licensing. AV1 is free and saves 53%. VVC is the future but not ready yet.
  4. VOD transcoding essentials: CRF quality control, GOP alignment, faststart, disable scene-cut.
  5. Per-title / per-shot encoding is an advanced optimization; cloud “smart transcoding” covers most gains.
  6. VMAF is the industry-standard quality metric.

Previous: Part 1: Video Fundamentals

Next: Part 3: Audio Fundamentals