VOD Deep Dive Part 1: Video Fundamentals — What Is a Video, Really?
The first installment of our 12-part VOD streaming series. Learn what video actually is at the byte level — pixels, resolution, frame rates, bitrate, I/P/B frames, GOP, color spaces, and HDR.
This is Part 1 of the VOD Streaming Deep Dive series — a 12-part technical guide covering everything from raw pixels to global-scale delivery.
Questions You’ve Probably Never Thought About
You use video every day, but have you ever wondered:
- What actually happens between tapping “play” and seeing the first frame?
- Why does a 2-hour Netflix movie weigh only 2 GB, while 2 hours of raw iPhone footage is 20 GB?
- Why does video “get blurry” on a weak connection instead of just freezing?
- Why can iPhones only play HLS but not DASH natively?
- Why can’t you copy a downloaded Netflix movie to someone else’s phone?
- How do short-form video apps achieve near-instant playback when you swipe?
By the end of this series, you’ll be able to answer every one of these.
What Is Video on Demand (VOD)?
Video on Demand — VOD — means exactly what it says: the user watches whatever they want, whenever they want. The video file was recorded and stored on a server long before playback.
The counterpart is live streaming:
| VOD | Live Streaming | |
|---|---|---|
| Content source | Pre-recorded files | Camera/encoder producing in real time |
| Seekable? | Yes | No (or limited DVR window) |
| Examples | Netflix, YouTube, Bilibili, online courses | Sports broadcasts, e-commerce live, game streaming |
| Engineering challenge | Deliver to the most users at the lowest cost | Keep latency low, encode in real time |
Short-form video (TikTok, YouTube Shorts) is also VOD. Although it feels real-time, every clip is a pre-uploaded recording. It’s broken into second-level segments and served by recommendation algorithms — that’s what makes the feed feel endless.
This series focuses on VOD, but most of the technology (codecs, containers, protocols, DRM) applies to live streaming too.
The VOD Journey: From Camera to Your Screen
A video goes through six stages to reach your phone:
①Capture/Upload ②Transcode ③Package
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Director │ ────► │ Compress │ ──────► │ Cut into │
│ uploads │ │ into many│ │ small │
│ raw file │ │ qualities│ │ segments │
└─────────┘ └─────────┘ └─────────┘
│
┌───────────────────────────────────────┘
│
▼
④Store in Cloud ⑤CDN Distribution ⑥Playback
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Put into │ ────► │ Copy to │ ──────► │ Auto- │
│ object │ │ nearest │ │ select │
│ storage │ │ data │ │ quality │
│ (S3 etc) │ │ center │ │ & play │
└─────────┘ └─────────┘ └─────────┘
Each stage maps to a chapter in this series:
| Stage | Problem it solves | Series part |
|---|---|---|
| ①Capture/Upload | How to reliably send large files to the server | Part 11 |
| ②Transcode | How to compress a 20 GB master down to 200 MB and still look great | Part 2, Part 3 |
| ③Package | How to combine video + audio + subtitles and slice into segments | Part 4, Part 5 |
| ④Storage | How to store massive amounts of video cheaply | Part 11, Part 12 |
| ⑤CDN | How to make it fast for users worldwide | Part 7 |
| ⑥Playback | How to adapt to network speed and prevent piracy | Part 6, Part 8, Part 9 |
And the thread running through everything — how do you know if users are having a good experience? — is Part 10: QoE Metrics.
A Video Is Just a Stack of Photos
This is the single most important sentence in this chapter:
A video = a sequence of images played rapidly + an audio track.
When you watch a video, your brain sees:
Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 ...
┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐
│ │ │ │ │ │ │ │ │ │
│ 🚗 │ │ 🚗 │ │ 🚗 │ │ 🚗 │ │ 🚗 │
│ │ │ │ │ │ │ │ │ │
└────┘ └────┘ └────┘ └────┘ └────┘
(car shifts slightly right)
┃
▼ Play 30 images per second → you see "smooth driving"
Each image is called a frame.
Pixels and Resolution
Zoom into any image far enough and you’ll see tiny squares — each one records a single color. That square is a pixel.
- A 1920×1080 image has 1920 columns × 1080 rows = 2,073,600 pixels (~2 megapixels).
- Each pixel stores a color value that takes a few bytes.
Resolution is just the pixel dimensions. The common labels:
| Label | Resolution | Total pixels | Relative size |
|---|---|---|---|
| 240p | 426 × 240 | ~100K | 1x (baseline) |
| 360p | 640 × 360 | ~230K | 2.3x |
| 480p (SD) | 854 × 480 | ~410K | 4.1x |
| 720p (HD) | 1280 × 720 | ~920K | 9.2x |
| 1080p (FHD) | 1920 × 1080 | ~2M | 20x |
| 1440p (2K) | 2560 × 1440 | ~3.7M | 37x |
| 2160p (4K UHD) | 3840 × 2160 | ~8.3M | 83x |
| 4320p (8K) | 7680 × 4320 | ~33.2M | 332x |
Note: “4K” has two flavors — UHD 4K (consumer: 3840×2160) and DCI 4K (cinema: 4096×2160).
Portrait mobile video uses a 9:16 ratio (e.g., 720×1280), the inverse of landscape 16:9 (1920×1080).
How Pixels Store Color: RGB, YUV, and Bit Depth
RGB
The most intuitive method: store Red, Green, Blue intensity per pixel.
- Black = R:0 G:0 B:0
- White = R:255 G:255 B:255
- Each channel uses 8 bits (1 byte, 0–255), so one RGB pixel = 3 bytes.
Quick math: a single 1080p RGB frame = 1920 × 1080 × 3 bytes ≈ 6.2 MB. At 30 fps, that’s 186 MB/sec — a 2-hour movie would be 1.3 TB uncompressed!
That’s why video must be compressed.
YUV (The Video Industry Standard)
Video uses YUV (also written YCbCr):
- Y (Luma): How bright the pixel is (0 = black, 255 = white)
- U, V (Chroma): What color the pixel is
Why not just use RGB? Because:
The human eye is far more sensitive to brightness than to color.
YUV exploits this: you can record less color information with virtually no perceived difference.
Chroma Subsampling
| Scheme | Description | Data vs 4:4:4 | Used in |
|---|---|---|---|
| 4:4:4 | Full Y/U/V per pixel | 100% | Film post-production |
| 4:2:2 | Two adjacent pixels share one U/V pair | 67% | Broadcast, professional |
| 4:2:0 | Four adjacent pixels share one U/V pair | 50% | Nearly all consumer streaming |
Luma Y (all kept) Chroma U/V (one per 2×2 block)
┌──┬──┬──┬──┐ ┌─────┬─────┐
│Y │Y │Y │Y │ │ │ │
├──┼──┼──┼──┤ │ UV │ UV │
│Y │Y │Y │Y │ │ │ │
├──┼──┼──┼──┤ ├─────┼─────┤
│Y │Y │Y │Y │ │ │ │
├──┼──┼──┼──┤ │ UV │ UV │
│Y │Y │Y │Y │ │ │ │
└──┴──┴──┴──┘ └─────┴─────┘
16 Y values 4 UV pairs
RGB 4:4:4 = 16 × 3 = 48 bytes
YUV 4:2:0 = 16 + 4 + 4 = 24 bytes (half the data)
The trade-off: sharp red text on a pure black background may show slight color bleeding. But 99% of natural scenes look identical.
Bit Depth
How many bits per channel:
| Bit depth | Range per channel | Colors per pixel | Used in |
|---|---|---|---|
| 8-bit | 0–255 | 16.7M | Most consumer streaming |
| 10-bit | 0–1023 | 1.07B | HDR required; Netflix 4K, Blu-ray |
| 12-bit | 0–4095 | 68.7B | Film masters, Dolby Vision |
8-bit is usually fine, but on smooth gradients (e.g., a blue sky transitioning from deep to light blue), you get visible banding — unnatural step-like boundaries. HDR content needs 10-bit to eliminate this.
Frame Rate (fps)
fps = frames per second.
- 24 fps: Cinema standard since the 1920s. Gives that “film look.”
- 25 / 50 fps: PAL television (Europe, China).
- 29.97 / 30 fps: NTSC (North America, Japan). Default for most phone recordings.
- 60 fps: Gaming, sports, YouTube high-frame-rate.
- 120 / 240 fps: Slow motion, professional capture.
Why is 24 fps enough for movies? Human “persistence of vision” kicks in around 16 fps — your brain already sees continuous motion. 24 fps was the 1920s sweet spot of “smooth enough + saves the most film stock.” But for fast action (sports, gaming), 60+ fps is needed to avoid motion blur.
Watch out for 29.97 fps — it’s not a typo. NTSC color television deliberately offset the frequency to avoid interference with black-and-white signals.
Higher frame rate = larger file. 60 fps is roughly 1.7× the size of 30 fps at the same resolution and quality.
Bitrate: How Much Data Per Second
Bitrate is the number of bits consumed per second of video.
- kbps (kilobits/sec): 1 Mbps = 1000 kbps
- Mbps (megabits/sec): the common unit
File size ≈ bitrate × duration:
1 Mbps × 60 seconds ÷ 8 (bits to bytes) ≈ 7.5 MB
Typical Bitrates (H.264)
| Resolution | Recommended bitrate | 1 min file size |
|---|---|---|
| 240p | 0.3–0.5 Mbps | ~3 MB |
| 360p | 0.5–0.8 Mbps | ~5 MB |
| 480p | 0.8–1.2 Mbps | ~8 MB |
| 720p | 1.5–3 Mbps | ~15 MB |
| 1080p | 3–6 Mbps | ~30 MB |
| 4K | 15–30 Mbps | ~150 MB |
CBR / VBR / CRF
Three rate-control modes:
| Mode | Meaning | Analogy |
|---|---|---|
| CBR (Constant Bitrate) | Fixed bits per second | Always ordering exactly 2 dishes |
| VBR (Variable Bitrate) | More bits for complex scenes, fewer for simple ones | Big eater orders more, light eater orders less |
| CRF (Constant Rate Factor) | Quality stays constant, bitrate adapts | No matter what you order, you eat until 80% full |
VOD favors VBR or CRF — better quality at the same file size. Live streaming favors CBR — predictable bitrate for stable network transmission.
I-Frames, P-Frames, B-Frames: The Core of Video Compression
This is the most critical concept in this chapter. Understand it, and everything in the codec chapter falls into place.
Why Can Video Be Compressed So Aggressively?
Imagine a video of someone sitting on a couch watching TV:
Frame 1: Person on couch, TV playing animation
Frame 2: Person on couch, TV playing animation (TV image changes slightly)
Frame 3: Person blinks, TV playing animation
Frame 4: Person on couch, TV playing animation
99% of pixels between adjacent frames are identical. Storing every frame in full is massive waste.
The smart approach:
- Occasionally store a “complete snapshot”
- The rest of the time, only store “what changed since the last frame”
Three Frame Types
| Type | Full name | Content | Size | Can decode independently? |
|---|---|---|---|---|
| I-frame (keyframe) | Intra-coded | A complete image (like a JPEG) | Large | Yes |
| P-frame | Predicted | ”Difference from a previous frame” | Small | No — needs the reference frame first |
| B-frame | Bidirectional | ”Difference from both previous and next frames” | Smallest | No — needs both reference frames |
Timeline →
I - P - P - P - B - P - P - B - I - P - P - P ...
▲ ▲
Keyframe Next keyframe
(appears every N frames)
IDR Frames
An IDR frame (Instantaneous Decoder Refresh) is a special I-frame: all subsequent frames are forbidden from referencing anything before it. IDR frames are “safe start points.” When you seek to the middle of a video, the player jumps to the nearest IDR frame to begin decoding.
GOP (Group of Pictures)
A GOP is the group of frames between two I-frames:
┌──── GOP 1 ────┐ ┌──── GOP 2 ────┐ ┌──── GOP 3 ...
I P B P P B I P B P P B I P ...
▲
New IDR starts here
GOP length determines segmentation granularity:
- Short GOP (1–2 sec): Fine segments, fast seeking and startup; slightly larger files (more I-frames)
- Long GOP (4–10 sec): Smaller files, but slower seeking
Short-form video apps typically use short GOPs (1–2s) because users swipe frequently between episodes. Feature-length VOD can use longer GOPs to save bandwidth.
Color Spaces and HDR
Color Spaces
The same numeric RGB values display different actual colors under different standards:
| Standard | Used in | Gamut size |
|---|---|---|
| sRGB | Web, computers | Baseline |
| BT.709 | HDTV, 1080p streaming | ≈ sRGB |
| BT.2020 | HDR, 4K/8K | ~72% larger than BT.709 |
| DCI-P3 | Cinema, Apple ecosystem | Between BT.709 and BT.2020 |
HDR: Brighter Brights, Darker Darks, More Colors
Traditional SDR peaks at ~100 nits. HDR reaches 1,000–4,000 nits peak brightness, combined with 10-bit depth + BT.2020 gamut:
- Stars in a night sky appear brighter
- Shadow details are preserved
- Colors are more saturated without clipping
Major HDR formats:
| Format | By | Key feature |
|---|---|---|
| HDR10 | Blu-ray Disc Association | Royalty-free; static metadata per movie |
| HDR10+ | Samsung / Amazon | Dynamic metadata per scene |
| Dolby Vision | Dolby | 12-bit, dynamic metadata; highest quality; royalty required |
| HLG | BBC / NHK | Compatible with SDR displays; preferred for broadcast |
Be aware: HDR video on an SDR display won’t magically look better. Without tone mapping, it looks washed out and gray.
Hands-On: Inspect a Video with ffprobe
# macOS / Linux
brew install ffmpeg # or: apt install ffmpeg
# Inspect a video
ffprobe -v error -show_streams -select_streams v:0 myvideo.mp4
Typical output:
codec_name=h264 # Codec (H.264) — see Part 2
profile=High # Encoding profile
width=1920
height=1080 # Resolution: 1080p
r_frame_rate=30000/1001 # Frame rate: 29.97 fps
pix_fmt=yuv420p # Pixel format: YUV 4:2:0, 8-bit
color_space=bt709 # Color space: SDR
bit_rate=4500000 # Bitrate: 4.5 Mbps
After reading this chapter, you should understand every line.
Key Takeaways
- Video = a sequence of images + audio. Each image is a frame.
- Each frame is made of pixels; resolution is the pixel dimensions.
- The video world uses YUV 4:2:0 (half the data of RGB, imperceptible difference).
- Bit depth: 8-bit is standard; HDR requires 10-bit.
- Frame rate: 24 fps (cinema) / 30 fps (TV) / 60 fps (gaming/sports).
- Bitrate = data per second. VBR/CRF is preferred for VOD.
- I/P/B frames are how video achieves 50–100× compression.
- GOP = the group between keyframes. Short-form video uses short GOPs (1–2s).
- HDR = 10-bit + wider gamut + higher brightness — fundamentally different from SDR.
Three Pairs of Concepts You’ll See Everywhere
Before diving deeper, pin these down:
-
Codec ≠ Container — H.264 is a compression algorithm (codec); MP4 is a file format (container). An
.mp4file can hold H.264 or H.265 or AV1. -
Protocol ≠ Packaging — HLS and DASH are “how to deliver” rules (protocols); fMP4 and TS are “how to slice and wrap” formats (packaging).
-
Encryption ≠ DRM — HLS AES-128 is lightweight encryption (key leaks = game over). DRM is an entire system: key distribution + device restrictions + output protection.
All three pairs are covered in detail throughout this series.
Next up: How does a 4K movie fit in 5 GB? → Part 2: Video Codecs — H.264, H.265, and AV1