VOD Deep Dive Part 1: Video Fundamentals — What Is a Video, Really?

This is Part 1 of the VOD Streaming Deep Dive series — a 12-part technical guide covering everything from raw pixels to global-scale delivery.

Questions You’ve Probably Never Thought About

You use video every day, but have you ever wondered:

What actually happens between tapping “play” and seeing the first frame?
Why does a 2-hour Netflix movie weigh only 2 GB, while 2 hours of raw iPhone footage is 20 GB?
Why does video “get blurry” on a weak connection instead of just freezing?
Why can iPhones only play HLS but not DASH natively?
Why can’t you copy a downloaded Netflix movie to someone else’s phone?
How do short-form video apps achieve near-instant playback when you swipe?

By the end of this series, you’ll be able to answer every one of these.

What Is Video on Demand (VOD)?

Video on Demand — VOD — means exactly what it says: the user watches whatever they want, whenever they want. The video file was recorded and stored on a server long before playback.

The counterpart is live streaming:

	VOD	Live Streaming
Content source	Pre-recorded files	Camera/encoder producing in real time
Seekable?	Yes	No (or limited DVR window)
Examples	Netflix, YouTube, Bilibili, online courses	Sports broadcasts, e-commerce live, game streaming
Engineering challenge	Deliver to the most users at the lowest cost	Keep latency low, encode in real time

Short-form video (TikTok, YouTube Shorts) is also VOD. Although it feels real-time, every clip is a pre-uploaded recording. It’s broken into second-level segments and served by recommendation algorithms — that’s what makes the feed feel endless.

This series focuses on VOD, but most of the technology (codecs, containers, protocols, DRM) applies to live streaming too.

The VOD Journey: From Camera to Your Screen

A video goes through six stages to reach your phone:

①Capture/Upload     ②Transcode           ③Package
┌─────────┐        ┌─────────┐         ┌─────────┐
│ Director │ ────►  │ Compress │ ──────► │ Cut into │
│ uploads  │        │ into many│         │ small    │
│ raw file │        │ qualities│         │ segments │
└─────────┘        └─────────┘         └─────────┘
                                             │
     ┌───────────────────────────────────────┘
     │
     ▼
④Store in Cloud     ⑤CDN Distribution    ⑥Playback
┌─────────┐        ┌─────────┐         ┌─────────┐
│ Put into │ ────►  │ Copy to  │ ──────► │ Auto-   │
│ object   │        │ nearest  │         │ select  │
│ storage  │        │ data     │         │ quality │
│ (S3 etc) │        │ center   │         │ & play  │
└─────────┘        └─────────┘         └─────────┘

Each stage maps to a chapter in this series:

Stage	Problem it solves	Series part
①Capture/Upload	How to reliably send large files to the server	Part 11
②Transcode	How to compress a 20 GB master down to 200 MB and still look great	Part 2, Part 3
③Package	How to combine video + audio + subtitles and slice into segments	Part 4, Part 5
④Storage	How to store massive amounts of video cheaply	Part 11, Part 12
⑤CDN	How to make it fast for users worldwide	Part 7
⑥Playback	How to adapt to network speed and prevent piracy	Part 6, Part 8, Part 9

And the thread running through everything — how do you know if users are having a good experience? — is Part 10: QoE Metrics.

A Video Is Just a Stack of Photos

This is the single most important sentence in this chapter:

A video = a sequence of images played rapidly + an audio track.

When you watch a video, your brain sees:

Frame 1   Frame 2   Frame 3   Frame 4   Frame 5  ...
┌────┐   ┌────┐   ┌────┐   ┌────┐   ┌────┐
│    │   │    │   │    │   │    │   │    │
│ 🚗 │   │ 🚗 │   │ 🚗 │   │ 🚗 │   │ 🚗 │
│    │   │    │   │    │   │    │   │    │
└────┘   └────┘   └────┘   └────┘   └────┘
         (car shifts slightly right)
           ┃
           ▼  Play 30 images per second → you see "smooth driving"

Each image is called a frame.

Pixels and Resolution

Zoom into any image far enough and you’ll see tiny squares — each one records a single color. That square is a pixel.

A 1920×1080 image has 1920 columns × 1080 rows = 2,073,600 pixels (~2 megapixels).
Each pixel stores a color value that takes a few bytes.

Resolution is just the pixel dimensions. The common labels:

Label	Resolution	Total pixels	Relative size
240p	426 × 240	~100K	1x (baseline)
360p	640 × 360	~230K	2.3x
480p (SD)	854 × 480	~410K	4.1x
720p (HD)	1280 × 720	~920K	9.2x
1080p (FHD)	1920 × 1080	~2M	20x
1440p (2K)	2560 × 1440	~3.7M	37x
2160p (4K UHD)	3840 × 2160	~8.3M	83x
4320p (8K)	7680 × 4320	~33.2M	332x

Note: “4K” has two flavors — UHD 4K (consumer: 3840×2160) and DCI 4K (cinema: 4096×2160).

Portrait mobile video uses a 9:16 ratio (e.g., 720×1280), the inverse of landscape 16:9 (1920×1080).

How Pixels Store Color: RGB, YUV, and Bit Depth

RGB

The most intuitive method: store Red, Green, Blue intensity per pixel.

Black = R:0 G:0 B:0
White = R:255 G:255 B:255
Each channel uses 8 bits (1 byte, 0–255), so one RGB pixel = 3 bytes.

Quick math: a single 1080p RGB frame = 1920 × 1080 × 3 bytes ≈ 6.2 MB. At 30 fps, that’s 186 MB/sec — a 2-hour movie would be 1.3 TB uncompressed!

That’s why video must be compressed.

YUV (The Video Industry Standard)

Video uses YUV (also written YCbCr):

Y (Luma): How bright the pixel is (0 = black, 255 = white)
U, V (Chroma): What color the pixel is

Why not just use RGB? Because:

The human eye is far more sensitive to brightness than to color.

YUV exploits this: you can record less color information with virtually no perceived difference.

Chroma Subsampling

Scheme	Description	Data vs 4:4:4	Used in
4:4:4	Full Y/U/V per pixel	100%	Film post-production
4:2:2	Two adjacent pixels share one U/V pair	67%	Broadcast, professional
4:2:0	Four adjacent pixels share one U/V pair	50%	Nearly all consumer streaming

   Luma Y (all kept)          Chroma U/V (one per 2×2 block)
   ┌──┬──┬──┬──┐             ┌─────┬─────┐
   │Y │Y │Y │Y │             │     │     │
   ├──┼──┼──┼──┤             │ UV  │ UV  │
   │Y │Y │Y │Y │             │     │     │
   ├──┼──┼──┼──┤             ├─────┼─────┤
   │Y │Y │Y │Y │             │     │     │
   ├──┼──┼──┼──┤             │ UV  │ UV  │
   │Y │Y │Y │Y │             │     │     │
   └──┴──┴──┴──┘             └─────┴─────┘
   16 Y values                4 UV pairs

   RGB 4:4:4 = 16 × 3 = 48 bytes
   YUV 4:2:0 = 16 + 4 + 4 = 24 bytes (half the data)

The trade-off: sharp red text on a pure black background may show slight color bleeding. But 99% of natural scenes look identical.

Bit Depth

How many bits per channel:

Bit depth	Range per channel	Colors per pixel	Used in
8-bit	0–255	16.7M	Most consumer streaming
10-bit	0–1023	1.07B	HDR required; Netflix 4K, Blu-ray
12-bit	0–4095	68.7B	Film masters, Dolby Vision

8-bit is usually fine, but on smooth gradients (e.g., a blue sky transitioning from deep to light blue), you get visible banding — unnatural step-like boundaries. HDR content needs 10-bit to eliminate this.

Frame Rate (fps)

fps = frames per second.

24 fps: Cinema standard since the 1920s. Gives that “film look.”
25 / 50 fps: PAL television (Europe, China).
29.97 / 30 fps: NTSC (North America, Japan). Default for most phone recordings.
60 fps: Gaming, sports, YouTube high-frame-rate.
120 / 240 fps: Slow motion, professional capture.

Why is 24 fps enough for movies? Human “persistence of vision” kicks in around 16 fps — your brain already sees continuous motion. 24 fps was the 1920s sweet spot of “smooth enough + saves the most film stock.” But for fast action (sports, gaming), 60+ fps is needed to avoid motion blur.

Watch out for 29.97 fps — it’s not a typo. NTSC color television deliberately offset the frequency to avoid interference with black-and-white signals.

Higher frame rate = larger file. 60 fps is roughly 1.7× the size of 30 fps at the same resolution and quality.

Bitrate: How Much Data Per Second

Bitrate is the number of bits consumed per second of video.

kbps (kilobits/sec): 1 Mbps = 1000 kbps
Mbps (megabits/sec): the common unit

File size ≈ bitrate × duration:

1 Mbps × 60 seconds ÷ 8 (bits to bytes) ≈ 7.5 MB

Typical Bitrates (H.264)

Resolution	Recommended bitrate	1 min file size
240p	0.3–0.5 Mbps	~3 MB
360p	0.5–0.8 Mbps	~5 MB
480p	0.8–1.2 Mbps	~8 MB
720p	1.5–3 Mbps	~15 MB
1080p	3–6 Mbps	~30 MB
4K	15–30 Mbps	~150 MB

CBR / VBR / CRF

Three rate-control modes:

Mode	Meaning	Analogy
CBR (Constant Bitrate)	Fixed bits per second	Always ordering exactly 2 dishes
VBR (Variable Bitrate)	More bits for complex scenes, fewer for simple ones	Big eater orders more, light eater orders less
CRF (Constant Rate Factor)	Quality stays constant, bitrate adapts	No matter what you order, you eat until 80% full

VOD favors VBR or CRF — better quality at the same file size. Live streaming favors CBR — predictable bitrate for stable network transmission.

I-Frames, P-Frames, B-Frames: The Core of Video Compression

This is the most critical concept in this chapter. Understand it, and everything in the codec chapter falls into place.

Why Can Video Be Compressed So Aggressively?

Imagine a video of someone sitting on a couch watching TV:

Frame 1: Person on couch, TV playing animation
Frame 2: Person on couch, TV playing animation (TV image changes slightly)
Frame 3: Person blinks, TV playing animation
Frame 4: Person on couch, TV playing animation

99% of pixels between adjacent frames are identical. Storing every frame in full is massive waste.

The smart approach:

Occasionally store a “complete snapshot”
The rest of the time, only store “what changed since the last frame”

Three Frame Types

Type	Full name	Content	Size	Can decode independently?
I-frame (keyframe)	Intra-coded	A complete image (like a JPEG)	Large	Yes
P-frame	Predicted	”Difference from a previous frame”	Small	No — needs the reference frame first
B-frame	Bidirectional	”Difference from both previous and next frames”	Smallest	No — needs both reference frames

Timeline →
 I - P - P - P - B - P - P - B - I - P - P - P ...
 ▲                               ▲
 Keyframe                        Next keyframe
 (appears every N frames)

IDR Frames

An IDR frame (Instantaneous Decoder Refresh) is a special I-frame: all subsequent frames are forbidden from referencing anything before it. IDR frames are “safe start points.” When you seek to the middle of a video, the player jumps to the nearest IDR frame to begin decoding.

GOP (Group of Pictures)

A GOP is the group of frames between two I-frames:

 ┌──── GOP 1 ────┐ ┌──── GOP 2 ────┐ ┌──── GOP 3 ...
  I  P  B  P  P  B  I  P  B  P  P  B  I  P ...
                    ▲
                    New IDR starts here

GOP length determines segmentation granularity:

Short GOP (1–2 sec): Fine segments, fast seeking and startup; slightly larger files (more I-frames)
Long GOP (4–10 sec): Smaller files, but slower seeking

Short-form video apps typically use short GOPs (1–2s) because users swipe frequently between episodes. Feature-length VOD can use longer GOPs to save bandwidth.

Color Spaces and HDR

Color Spaces

The same numeric RGB values display different actual colors under different standards:

Standard	Used in	Gamut size
sRGB	Web, computers	Baseline
BT.709	HDTV, 1080p streaming	≈ sRGB
BT.2020	HDR, 4K/8K	~72% larger than BT.709
DCI-P3	Cinema, Apple ecosystem	Between BT.709 and BT.2020

HDR: Brighter Brights, Darker Darks, More Colors

Traditional SDR peaks at ~100 nits. HDR reaches 1,000–4,000 nits peak brightness, combined with 10-bit depth + BT.2020 gamut:

Stars in a night sky appear brighter
Shadow details are preserved
Colors are more saturated without clipping

Major HDR formats:

Format	By	Key feature
HDR10	Blu-ray Disc Association	Royalty-free; static metadata per movie
HDR10+	Samsung / Amazon	Dynamic metadata per scene
Dolby Vision	Dolby	12-bit, dynamic metadata; highest quality; royalty required
HLG	BBC / NHK	Compatible with SDR displays; preferred for broadcast

Be aware: HDR video on an SDR display won’t magically look better. Without tone mapping, it looks washed out and gray.

Hands-On: Inspect a Video with ffprobe

# macOS / Linux
brew install ffmpeg  # or: apt install ffmpeg

# Inspect a video
ffprobe -v error -show_streams -select_streams v:0 myvideo.mp4

Typical output:

codec_name=h264            # Codec (H.264) — see Part 2
profile=High               # Encoding profile
width=1920
height=1080                # Resolution: 1080p
r_frame_rate=30000/1001    # Frame rate: 29.97 fps
pix_fmt=yuv420p            # Pixel format: YUV 4:2:0, 8-bit
color_space=bt709          # Color space: SDR
bit_rate=4500000           # Bitrate: 4.5 Mbps

After reading this chapter, you should understand every line.

Key Takeaways

Video = a sequence of images + audio. Each image is a frame.
Each frame is made of pixels; resolution is the pixel dimensions.
The video world uses YUV 4:2:0 (half the data of RGB, imperceptible difference).
Bit depth: 8-bit is standard; HDR requires 10-bit.
Frame rate: 24 fps (cinema) / 30 fps (TV) / 60 fps (gaming/sports).
Bitrate = data per second. VBR/CRF is preferred for VOD.
I/P/B frames are how video achieves 50–100× compression.
GOP = the group between keyframes. Short-form video uses short GOPs (1–2s).
HDR = 10-bit + wider gamut + higher brightness — fundamentally different from SDR.

Three Pairs of Concepts You’ll See Everywhere

Before diving deeper, pin these down:

Codec ≠ Container — H.264 is a compression algorithm (codec); MP4 is a file format (container). An .mp4 file can hold H.264 or H.265 or AV1.
Protocol ≠ Packaging — HLS and DASH are “how to deliver” rules (protocols); fMP4 and TS are “how to slice and wrap” formats (packaging).
Encryption ≠ DRM — HLS AES-128 is lightweight encryption (key leaks = game over). DRM is an entire system: key distribution + device restrictions + output protection.

All three pairs are covered in detail throughout this series.

Next up: How does a 4K movie fit in 5 GB? → Part 2: Video Codecs — H.264, H.265, and AV1