VOD Deep Dive Part 2: Video Codecs — Why a 4K Movie Fits in 5 GB
How video compression works, why H.264 still dominates, when to choose H.265 or AV1, per-title encoding, VMAF quality metrics, and hands-on ffmpeg examples.
This is Part 2 of the VOD Streaming Deep Dive series.
Let’s Do the Math: How Big Is Uncompressed Video?
From Part 1, we know:
A 1080p, 30 fps, YUV 4:2:0, 8-bit uncompressed video stream:
Per-second size = 1920 × 1080 × 1.5 (YUV 4:2:0) × 30 ÷ 1024² ≈ 89 MB/sec
So:
- 1 minute ≈ 5.3 GB
- 90-minute movie ≈ 480 GB
- 4K movie (4× the pixels) ≈ 1.9 TB
A real Netflix 4K movie is 5–15 GB. That means:
Encoding compresses video to less than 1% of its raw size.
Not magic — decades of mathematics. Here’s how.
Encoding and Decoding: Two Sides of a Coin
Raw video (89 MB/s) Compressed video (2 MB/s) Display
┃ ┃ ┃
▼ ▼ ▼
┌──────┐ encode ┌──────┐ decode ┌──────┐
│ Raw │ ──────────────► │ File │ ────────────────► │ Play │
│ file │ (x264/x265) │ │ (player/hardware) │ │
└──────┘ └──────┘ └──────┘
- Encoding: Compress a large file into a small one → slow, CPU/GPU-intensive
- Decoding: Reconstruct the image from the compressed file → fast, phones have dedicated hardware
Encoder + Decoder = Codec (coder-decoder).
Why is encoding so much slower? Encoding explores all possibilities to find the optimal compression. Decoding just follows the instructions. Like packing an oddly-shaped item into a box versus opening the box.
The Two Axes of Video Compression
Axis 1: Intra-frame Compression
Compress each image on its own — similar to JPEG:
- The human eye is sensitive to low-frequency information (large color blocks) but not high-frequency detail (noise, fine edges)
- DCT (Discrete Cosine Transform) converts pixels from spatial to frequency domain
- Unimportant high-frequency coefficients are quantized away
This produces I-frames — each independently decodable.
Axis 2: Inter-frame Compression
Exploit the fact that adjacent frames are nearly identical — only store differences.
Frame N (I-frame): Complete image
┌─────────────┐
│ 🚗 │
│ ___________ │ ← Stored in full
└─────────────┘
Frame N+1 (P-frame): Only the difference
"Move the car in frame N 15 pixels to the right"
─────────► A few bytes to describe
The technique: Motion Estimation + Motion Compensation:
- Divide the frame into small blocks (macroblocks, typically 16×16 or 8×8)
- For each block, search the previous frame for the best match
- Record only the motion vector (how far it moved) + residual (tiny remaining difference)
Inter-frame compression is vastly more efficient than intra-frame — this is why video is orders of magnitude smaller than a JPEG image sequence.
Additional Codec Techniques (Know They Exist, Don’t Memorize)
| Technique | What it does | Effect |
|---|---|---|
| Transform | DCT / Integer Transform: pixels → frequency coefficients | Prepares data for quantization |
| Quantization | Divide coefficients by an integer, round | Main quality/compression knob |
| Entropy Coding | CABAC / CAVLC: use fewer bits for common symbols | Lossless final squeeze |
| In-loop Filter | Remove blocking artifacts | Smoother image |
| SAO / ALF (H.265+) | Adaptive sample offset | Reduces edge artifacts |
| Multi-reference | P/B frames can reference multiple past frames | Better prediction → smaller residuals |
In practice, you control these through encoder parameters — no need to implement them yourself.
The Five Major Codecs
H.264 / AVC (2003): The Universal Gold Standard
- Compatibility: virtually every device that plays video supports it
- Compression: our baseline for comparison
- Patents: paid (MPEG LA pool), but industry default
- Best for: maximum compatibility, constrained compute
H.264 is 20+ years old and still the fallback codec for YouTube, Facebook, and Zoom.
H.265 / HEVC (2013): Better Compression, Licensing Nightmare
- Compression: ~37% bitrate savings vs H.264 at equivalent quality
- Patents: chaotic and expensive — three patent pools (MPEG LA, HEVC Advance, Velos Media) + many un-pooled patents
- Hardware decode: iPhone 7+ (2016), Android 6+ flagships, most 4K TVs
- Best for: Apple ecosystem, 4K streaming, bandwidth-sensitive scenarios
The licensing mess is why HEVC adoption on the web was so slow. Chrome and Firefox both resisted adding support.
VP9 (2013): Google’s Free Alternative
- By: Google (acquired On2 Technologies)
- Compression: close to H.265
- Patents: royalty-free
- Primary use: YouTube, Google Meet
- Caveat: iOS does not natively support VP9
AV1 (2018): The Royalty-Free Next Generation
- By: AOMedia Alliance (Google, Netflix, Meta, Amazon, Cisco, Microsoft, Intel, Apple, and more)
- Compression: ~53% savings vs H.264, ~25% better than H.265
- Patents: royalty-free
- Hardware decode: iPhone 15 Pro+ (2023), Pixel 6+, Snapdragon 8 Gen 2+, Intel Arc, NVIDIA RTX 40+
- Encoding speed: early SVT-AV1 implementations dramatically improved; CPU encoding is ~2–5× slower than H.265
Netflix, YouTube, TikTok, and Meta are all moving toward AV1 as the primary codec.
H.266 / VVC (2020): Latest Generation, Not Yet Mainstream
- Compression: ~78% savings vs H.264, ~25–30% better than AV1
- Hardware: flagships starting in 2024; consumer coverage still low
- Status: wait and see
Comparison Table
| H.264 | H.265 | VP9 | AV1 | H.266 | |
|---|---|---|---|---|---|
| Year | 2003 | 2013 | 2013 | 2018 | 2020 |
| Savings vs H.264 | baseline | 37% | ~30% | 53% | 78% |
| Encode speed | Fastest | Mid | Mid | Slow | Slowest |
| Decode load | Lightest | Mid | Mid | Higher | Heavy |
| Compatibility | Universal | Very good | Good (web) | Growing | Low |
| Royalties | Paid | High + messy | Free | Free | Paid |
These percentages come from specific test sets and conditions (BD-rate with VMAF/PSNR). Real-world results vary significantly by content type (animation vs. live action vs. screen recording). Don’t use them as absolute marketing claims.
How to Choose a Codec in Practice
Where do your users watch?
│
├── ① Web browsers + all phones + legacy set-top boxes
│ → Must have H.264 (fallback)
│ → Add H.265 (iOS) + AV1 (Android flagships / modern Chrome) for bandwidth savings
│
├── ② Native app only (iOS + Android + optional TV)
│ → H.264 + H.265 as primary; AV1 gradual rollout by device capability
│
├── ③ Web-first, global bandwidth cost matters (YouTube/Netflix scale)
│ → AV1 primary + H.264 fallback
│
└── ④ 4K / HDR premium content
→ H.265 / AV1 + Dolby Vision
A Typical Short-Form Video Encoding Ladder
| Tier | Resolution | H.264 bitrate | H.265 bitrate | AV1 bitrate |
|---|---|---|---|---|
| Low | 360p | 500 kbps | 350 kbps | 250 kbps |
| Mid | 540p | 900 kbps | 650 kbps | 480 kbps |
| Main | 720p | 1.5 Mbps | 1.0 Mbps | 750 kbps |
| High | 1080p | 3.5 Mbps | 2.2 Mbps | 1.6 Mbps |
Hands-On: Compress a Video with ffmpeg
Basic H.264
ffmpeg -i input.mov -c:v libx264 output.mp4
CRF Quality Control
ffmpeg -i input.mov \
-c:v libx264 -preset medium -crf 23 \
-c:a aac -b:a 128k \
output.mp4
| Parameter | Meaning |
|---|---|
-preset medium | Speed/compression trade-off. Options: ultrafast → veryslow. Slower = smaller file at same quality |
-crf 23 | Quality target, 0–51. Lower = better quality, larger file. Default is 23. |
-c:a aac -b:a 128k | Audio: AAC at 128 kbps |
CRF quick reference: 18 = visually lossless, 23 = high quality, 28 = acceptable (visible compression).
Stream-Ready VOD Encoding
ffmpeg -i input.mov \
-c:v libx264 -preset slow -crf 22 \
-profile:v high -level 4.0 \
-g 60 -keyint_min 60 -sc_threshold 0 \
-c:a aac -b:a 128k \
-movflags +faststart \
output.mp4
| Parameter | Why |
|---|---|
-g 60 -keyint_min 60 | One I-frame every 60 frames. At 30 fps = 2-second GOP, aligns with segmentation. |
-sc_threshold 0 | Disable scene-cut auto I-frame insertion. Ensures all bitrate tiers have I-frames at the same positions. |
-movflags +faststart | Move the MP4 “table of contents” (moov box) to the start of the file — enables progressive playback. See Part 4. |
-profile:v high -level 4.0 | Compatibility: Level 4.0 supports up to 1080p30. |
H.265 Encoding
ffmpeg -i input.mov \
-c:v libx265 -preset medium -crf 26 \
-tag:v hvc1 \
-c:a aac -b:a 128k \
-movflags +faststart \
output_hevc.mp4
Note: H.265 CRF values need to be ~3–5 higher than H.264 for equivalent visual quality (CRF 26 ≈ H.264 CRF 22). The -tag:v hvc1 tag is required for Apple devices to recognize the file.
Compression Results (1-min 4K source, ~2 GB raw)
| Command | Output size | Ratio |
|---|---|---|
| Uncompressed YUV | ~2 GB | 100% |
| H.264 CRF 23 | ~30 MB | 1.5% |
| H.264 CRF 18 | ~80 MB | 4% |
| H.265 CRF 26 | ~18 MB | 0.9% |
| AV1 (SVT-AV1 preset 8) | ~12 MB | 0.6% |
100–500× compression, virtually indistinguishable on a phone screen.
Per-Title and Per-Shot Encoding
The default approach is a fixed bitrate ladder for all content. But:
- A cartoon (flat colors, little detail) looks great at 1080p@1000k
- A concert (flashing lights, fast motion) still shows compression at 1080p@5000k
Per-Title Encoding (Netflix, 2015): calculate the optimal bitrate ladder for each title based on its visual complexity.
Traditional: Same ladder for every movie
360p@500k / 720p@1500k / 1080p@4000k
Per-Title: Custom ladder per movie
Cartoon: 360p@300k / 720p@800k / 1080p@1800k (saves money)
Concert: 360p@700k / 720p@2200k / 1080p@5500k (needs more bits)
Per-Shot Encoding (Netflix, 2018) goes further: split a movie by scene cuts, then optimize each shot independently using VMAF as the quality target. Claims 17% additional savings at equal quality.
For most platforms, cloud providers’ “smart transcoding” templates (AWS MediaConvert QVBR, Alibaba Cloud “Narrowband HD”) deliver most of the benefit without building this yourself.
Hardware vs. Software Encoding
| Approach | Implementation | Speed | Quality | Best for |
|---|---|---|---|---|
| Software (CPU) | libx264 / libx265 / SVT-AV1 | Slow | Best | VOD offline transcoding |
| Hardware (GPU/ASIC) | NVIDIA NVENC, Intel QSV, Apple VideoToolbox | 5–50× faster | Slightly lower | Live streaming, real-time |
VOD should use software encoding: you only encode once, but the bandwidth savings last forever. Live streaming must use hardware encoding — you can’t spend 10 seconds encoding 1 second of video.
VMAF, PSNR, SSIM: Measuring Visual Quality
| Metric | Full name | Method | Range | Correlation with human perception |
|---|---|---|---|---|
| PSNR | Peak Signal-to-Noise Ratio | Pixel-level difference | 0–∞ dB (higher = better) | Weak |
| SSIM | Structural Similarity | Luminance/contrast/structure | 0–1 (higher = better) | Medium |
| VMAF | Video Multi-Method Assessment Fusion | ML fusion of multiple features | 0–100 (higher = better) | Strong |
VMAF (open-sourced by Netflix) is the industry standard for perceptual quality:
- VMAF ≥ 93: Visually lossless
- VMAF ≈ 80: High quality
- VMAF ≈ 60: Acceptable
- VMAF ≤ 40: Obvious compression artifacts
Key Takeaways
- Video compression achieves <1% of raw size through intra-frame (compress each image) + inter-frame (store only differences) compression.
- Encoding is slow, decoding is fast. Encoder + Decoder = Codec.
- H.264 is the universal fallback. H.265 saves 37% but has messy licensing. AV1 is free and saves 53%. VVC is the future but not ready yet.
- VOD transcoding essentials: CRF quality control, GOP alignment, faststart, disable scene-cut.
- Per-title / per-shot encoding is an advanced optimization; cloud “smart transcoding” covers most gains.
- VMAF is the industry-standard quality metric.
Previous: Part 1: Video Fundamentals