VOD Deep Dive Part 4: Container Formats — .mp4 Is Not a Codec

Containers vs codecs, MP4 internals (Box structure), the faststart trap, fragmented MP4, CMAF for unified HLS+DASH, segment length trade-offs, and subtitle formats.

zhuermu · · 18 min
vodstreamingmp4fmp4cmafcontainer

This is Part 4 of the VOD Streaming Deep Dive series.


The Most Common Misconception

Here’s a mistake every beginner makes:

❌ “The video codec is MP4.”

✅ “The video container is MP4. The codec is H.264 (or H.265, AV1, etc.).”

MP4 is a container — a box. The box can hold any supported codec stream.

┌──────────────────────────────────────┐
│   MP4 Container (a .mp4 file)        │
│  ┌────────────────────────────────┐  │
│  │ Video stream: H.264 / H.265 / AV1│
│  └────────────────────────────────┘  │
│  ┌────────────────────────────────┐  │
│  │ Audio stream: AAC / Opus / AC-3 │  │
│  └────────────────────────────────┘  │
│  ┌────────────────────────────────┐  │
│  │ Subtitle stream: TTML / WebVTT  │  │
│  └────────────────────────────────┘  │
│  ┌────────────────────────────────┐  │
│  │ Metadata: title, duration, index│  │
│  └────────────────────────────────┘  │
└──────────────────────────────────────┘

Think of MP4 as a shipping box. H.264 is the product inside. The same box can hold a phone (H.264), a shirt (H.265), or fruit (AV1).


Why Do We Need Containers?

Can’t we just store raw encoded bitstreams?

No. Raw bitstreams lack:

  • How to synchronize video and audio playback
  • How to align subtitles
  • Where to start reading when the user seeks to 1:00
  • How many audio tracks exist
  • What codec is used, with what parameters

Containers organize all of this so the player can read data along a timeline.


Major Container Formats

ContainerExtensionCreatorTypical contentsBest for
MP4 / fMP4.mp4 .m4s .m4aMPEGH.264/H.265/AV1 + AACMost universal, streaming default
MOV.movAppleNearly anythingmacOS production, mezzanine transfer
MKV.mkvCoreCodecNearly anythingHD downloads, open-source community
WebM.webmGoogleVP9/AV1 + OpusWeb (non-iOS)
MPEG-TS.tsMPEGH.264/HEVC + AAC/AC-3Broadcast, legacy HLS
FLV.flvAdobeH.264 + AACRetired (RTMP legacy)

For VOD, you realistically only need two: MP4/fMP4 (streaming) and MOV (mezzanine transfer).


MP4 / ISO BMFF Internals

MP4’s formal standard is ISO Base Media File Format (ISO BMFF), ISO/IEC 14496-12. Internally, it’s composed of nested units called Boxes (also called atoms):

File start

├── [ftyp]  File Type Box     ← Tells the player "this is MP4"

├── [moov]  Movie Box          ← Metadata "table of contents":
│    ├── [mvhd]  Movie Header (total duration, timescale)
│    ├── [trak]  Track Box (one per audio/video/subtitle track)
│    │    ├── [tkhd] Track Header
│    │    └── [mdia] Media
│    │         ├── [mdhd] Media Header
│    │         ├── [hdlr] Handler (vide/soun/subt)
│    │         └── [minf]
│    │              └── [stbl] Sample Table  ← byte offset of every frame
│    └── [trak] ... (more tracks)

└── [mdat]  Media Data Box     ← The actual audio/video binary data
      └── (large binary blob)

Two key concepts:

  • moov: The metadata/directory. Tells the player “frame 1 is at byte 12345, frame 2 is at byte 13800…”
  • mdat: The actual data. Pure H.264 + AAC binary.

The Faststart Trap

By default, MP4 encoders place the moov box at the end of the file (because the full timestamp index isn’t known until encoding completes).

Default MP4:
┌─────────┬─────────────────────────────────┬──────┐
│  ftyp   │         mdat (99.9%)            │ moov │
└─────────┴─────────────────────────────────┴──────┘
   8 bytes         hundreds of MB/GB          tens of KB

The problem: a web player must read moov before it can play, but moov is at the end → the entire file must download before playback begins.

Solution: Faststart (moov at the front)

Faststart MP4:
┌─────────┬──────┬──────────────────────────────┐
│  ftyp   │ moov │           mdat               │
└─────────┴──────┴──────────────────────────────┘

          After reading this (< 1 MB), playback can begin
ffmpeg -i input.mov -c copy -movflags +faststart output.mp4

VOD rule: always apply faststart before publishing. Without it, users see a long blank screen after pressing play.


Fragmented MP4 (fMP4): The Streaming Choice

Traditional MP4 has a problem: the moov box describes the entire file. For a multi-hour movie, moov grows to megabytes — slow to parse and expensive to update.

fMP4 (Fragmented MP4) splits the file into many small fragments, each with its own mini-moov (called moof):

fMP4 structure:

┌──────┬──────┐  ┌──────┬──────┐  ┌──────┬──────┐  ┌──────┬──────┐
│ moov │  -   │  │ moof │ mdat │  │ moof │ mdat │  │ moof │ mdat │
│ init │      │  │ frag1│frag1 │  │ frag2│frag2 │  │ frag3│frag3 │
└──────┴──────┘  └──────┴──────┘  └──────┴──────┘  └──────┴──────┘
  Standalone                                               
  init segment          Each fragment is independently decodable

Two core benefits:

  1. Independent segments: Each moof+mdat pair can be stored as a separate file (.m4s). Exactly what streaming protocols need.
  2. Startup only requires a tiny init segment (~tens of KB), not the entire moov.

Modern HLS, DASH, and CMAF are all based on fMP4. MPEG-TS (legacy HLS) is being phased out.


CMAF: One File to Rule HLS and DASH

Historically, Apple pushed HLS (with TS segments) and the rest of the industry pushed DASH (with fMP4 segments). Same video, stored twice.

CMAF (Common Media Application Format), standardized in 2018, fixes this:

HLS and DASH share the same fMP4 files. Only the manifest differs.

One set of CMAF fMP4 segments:
┌──────┐  ┌───────┐ ┌───────┐ ┌───────┐
│ init │  │ seg1  │ │ seg2  │ │ seg3  │  ← Only one copy on disk
└──────┘  └───────┘ └───────┘ └───────┘
            ↑              ↑
   ┌────────┴───────┐   ┌──┴──────────┐
   │ HLS master.m3u8│   │ DASH .mpd   │
   │ points to same │   │ points to   │
   │ segments       │   │ same segments│
   └────────────────┘   └─────────────┘

Benefits: storage halved, CDN cache hit rate doubled, transcode once.

Trade-off: all platforms must support CMAF (modern ones do) and DRM must use CBCS mode (compatible with FairPlay).

For projects started after 2020: use CMAF directly. Don’t do “HLS TS + DASH fMP4” dual publishing.


Segment Length: How Long Should Each Slice Be?

Segment lengthProsConsBest for
1–2 secFast startup, quick ABR adaptationMany files, many HTTP requestsShort-form video, low-latency live
2–4 secBalancedHLS/DASH recommended default (4s)
6–10 secFewer HTTP requests, CDN-friendlySlow startup, coarse seekingLong movies, traditional broadcast

Short-form video apps should use 2-second segments (users swipe frequently). Long-form VOD should use 4–6 seconds.

Segment length must be an integer multiple of the GOP duration — see Part 1 on keyframes.


Subtitles in Containers

Three approaches:

Sidecar (External)

Subtitles as separate files:

video.mp4
video.en.vtt   ← English subtitles
video.zh.vtt   ← Chinese subtitles

The HLS/DASH manifest references these files. Easy to add languages; changing subtitles doesn’t require re-encoding video.

Embedded

Subtitles as a track inside the MP4. One file contains everything.

Burned-In (Hardcoded)

Subtitles rendered directly into the video pixels. Cannot be turned off.

Recommendation: VOD with multi-language support → Sidecar WebVTT. Short-form video where subtitles are part of the creative → burned-in.

Common Subtitle Formats

FormatKey featureUsed in
SRTSimplest: text + timestampsUniversal
WebVTTSRT enhanced (styling, positioning)HTML5 / HLS standard
TTML / IMSC1XML, complex layoutDASH, broadcast
ASS / SSAPowerful styling, animationAnime community

Hands-On: Container Operations with ffmpeg

Inspect what’s inside an MP4

ffprobe -v error -show_streams input.mp4

Lists all video / audio / subtitle streams.

Convert .mov to fMP4 for streaming (no re-encoding)

ffmpeg -i input.mov \
  -c copy \
  -movflags +faststart+frag_keyframe+empty_moov+default_base_moof \
  output_fragmented.mp4

Slice into HLS segments (fMP4 format)

ffmpeg -i input.mp4 \
  -c:v libx264 -preset slow -crf 22 -g 60 -keyint_min 60 -sc_threshold 0 \
  -c:a aac -b:a 128k \
  -f hls \
  -hls_time 4 \
  -hls_segment_type fmp4 \
  -hls_playlist_type vod \
  -hls_list_size 0 \
  -hls_segment_filename "seg_%04d.m4s" \
  output.m3u8

Output:

output.m3u8      ← HLS manifest
init.mp4         ← CMAF init segment
seg_0000.m4s
seg_0001.m4s
seg_0002.m4s
...

This is a minimal working HLS stream. Host it on any web server (even python3 -m http.server) and play it with Safari or hls.js.


Key Takeaways

  1. Container ≠ Codec. MP4 is a container; H.264 is a codec.
  2. MP4 internals are Box-based: moov is the directory, mdat is the data.
  3. Always apply -movflags +faststart for VOD — moves moov to the front for progressive playback.
  4. fMP4 splits the file into independent fragments — the foundation of modern streaming.
  5. CMAF lets HLS and DASH share one set of fMP4 files: storage halved, cache doubled.
  6. TS is legacy. New projects should use fMP4/CMAF.
  7. Segment length: short-form video → 2s; long-form VOD → 4–6s.
  8. Subtitles: prefer sidecar WebVTT for multi-language.

Previous: Part 3: Audio Fundamentals

Next: Part 5: Streaming Protocols — HLS and DASH