VOD Deep Dive Part 10: QoE Metrics — How to Measure What Users Actually Feel
QoE vs QoS, six core metrics (VST, RBR, VSF, EBVS, VPF, Avg Bitrate), data pipelines, multi-dimensional drill-down, troubleshooting cases, and when to buy vs build.
This is Part 10 of the VOD Streaming Deep Dive series.
QoE vs QoS: Two Often-Confused Terms
| Abbreviation | Full name | Definition | Perspective |
|---|---|---|---|
| QoS | Quality of Service | Objective network/infrastructure performance (bandwidth, packet loss, latency) | Network engineering / Ops |
| QoE | Quality of Experience | User-perceived quality | Product / User research |
QoS says “I’m delivering 10 Mbps.” QoE says “Can users actually watch without buffering?”
Same QoS can produce very different QoE depending on player implementation, ABR strategy, and encoding parameters.
We care about QoE. Running a video platform without QoE data is like operating a highway system without measuring traffic congestion.
Six Core QoE Metrics
1. Video Startup Time (VST)
Definition: Seconds from user pressing play to first frame rendered.
Targets:
- Short-form mobile video: P50 < 300ms, P95 < 800ms
- Long-form VOD: P50 < 1s, P95 < 2s
The most sensitive metric. Users won’t tolerate a 2-second black screen.
2. Rebuffering Ratio (RBR)
Definition: rebuffer_time / (rebuffer_time + play_time)
Example: User watched 60 seconds, buffered for 3 seconds total. RBR = 3/(60+3) = 4.8%.
Target: < 0.5% (excellent), < 1% (acceptable).
Impacts retention directly: Conviva research shows that every 1% increase in RBR reduces watch time by 2–5%.
3. Video Start Failure (VSF)
Definition: User triggered playback but the first frame never rendered (due to 404, CORS, DRM error, etc.).
Target: < 1%
Conviva further subdivides this:
- VSF-T (Technical): Failed due to technical reasons — counts as QoE failure
- VSF-B (Business): Failed for business reasons (no subscription, geo-blocked) — excluded from QoE
4. Exit Before Video Start (EBVS)
Definition: User triggered playback but left voluntarily before the first frame — not an error, just impatience.
Target: < 3%
Strongly correlated with VST. Slow startup → high EBVS → poor retention.
5. Video Playback Failure (VPF)
Definition: Playback crashed mid-stream (decode error, expired certificate, CDN stream cut).
Target: < 0.5%
6. Average Bitrate
Definition: Time-weighted average of bitrate tiers during actual playback.
Purpose: Measures whether users are actually seeing acceptable quality. If 70% of users average 480p, it could mean:
- Network conditions are generally poor
- ABR algorithm is too conservative
- High tiers weren’t transcoded
Supporting Metrics
| Metric | Description |
|---|---|
| Rebuffer Frequency | Rebuffer events per minute of playback (target: < 0.1/min) |
| Bitrate Switching | Number and magnitude of quality switches (stability preferred) |
| Video Complete Rate | Percentage of users who finish the video (business metric) |
| Time to Key Decode | Time to acquire DRM license |
| First Byte Time | Time until first byte arrives from CDN |
Conviva’s SPI: A Composite Index
SPI (Streaming Performance Index): Conviva’s composite KPI representing the percentage of sessions with “good or very good” experience.
A session qualifies as “good” when it simultaneously meets:
- No VSF-T or VPF-T errors
- No or minimal rebuffering (CIRR below threshold)
- Average bitrate meets the screen-size quality bar
- Video Start Time within acceptable range
- User didn’t wait excessively before exit
Single metrics can mislead (e.g., low RBR but extremely low bitrate). SPI provides a holistic view.
Multi-Dimensional Drill-Down
Never look at just “overall RBR.” Always slice by dimensions:
| Dimension | Examples |
|---|---|
| Geography | Country / State / City / ISP |
| Device | OS version, model, chipset, screen size |
| Network | WiFi / 4G / 5G, throughput range |
| CDN | Provider, PoP, Shield |
| Content | Title, resolution tier, codec, duration |
| Time | Hour, day, week |
| User | New/returning, paid/free, region |
Standard troubleshooting pattern:
Overall RBR spiked to 2% → cause unknown
↓ Drill by CDN → CDN-A RBR 5%, CDN-B RBR 0.3%
↓ Drill CDN-A by region → Mumbai RBR 12%
↓ Drill Mumbai by time → 19:00-22:00 peak spike
→ Conclusion: CDN-A Mumbai PoP degraded during evening peak
→ Action: Route India traffic to CDN-B
Data Pipeline: From Client to Dashboard
Typical Architecture
┌──────────────┐
│ Client SDK │
│ (iOS/Android/ │── HTTPS batch POST every 5-10s
│ Web) │ (events JSON)
└──────────────┘
│
▼
┌──────────────┐
│ Ingestion │ nginx / ALB / API Gateway / CloudFront
│ (Edge) │ with rate limiting + auth
└──────────────┘
│
▼
┌──────────────┐
│ Kafka │ Persistent message queue
│ Topic: qoe │ Partitioned by day
└──────────────┘
│
├─────────────────────┐
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Flink / Spark │ │ ClickHouse / │ Real-time data warehouse
│ Streaming │ │ BigQuery │
│ (aggregation) │ │ │
└──────────────┘ └──────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Alerting │ │ BI Dashboard │ Grafana, Tableau, Looker
│ (PagerDuty) │ │ │
└──────────────┘ └──────────────┘
Event Schema
Every event includes:
{
"event": "video_rebuffer_start",
"session_id": "uuid-...",
"user_id": "u-12345",
"video_id": "ep-789",
"timestamp": 1715084800123,
"player_version": "2.3.4",
"device": {
"os": "iOS",
"os_version": "17.4",
"model": "iPhone 15 Pro"
},
"network": {
"type": "cellular",
"carrier": "Verizon",
"effective_type": "4g"
},
"cdn": "cloudfront",
"bitrate": 2500000,
"buffer_level_sec": 0.8,
"position_sec": 45.2
}
Batch vs Real-Time
Don’t send one HTTP request per event (100K DAU x 100 events/user = 10M requests/day).
Batch strategy: Accumulate 10 seconds or 50 events, then send one POST.
Real-World Troubleshooting Cases
Case 1: Overall VST Spike
Monday 9 AM: VST P50 jumped from 400ms to 1.2s
│
├── Drill by OS → Android VST spiked to 2s, iOS normal
│
├── Drill by app version → v3.4.5 all 2s, v3.4.4 normal
│
├── Check changelog → v3.4.5 introduced a new player library
│
└── Action: Emergency hotfix / rollback to v3.4.4
Case 2: Single Title Rebuffering
New show Episode 3: RBR anomalously high at 5%
│
├── Drill by CDN → All CDNs high (not a CDN issue)
│
├── Check segments → One segment is 20 MB (others are 2 MB)
│
├── Check encoding log → 10-second action scene caused bitrate spike
│
└── Fix: Re-transcode with MaxBitrate cap on peak bitrate
Case 3: Regional Conversion Drop
India new-user first-hour completion rate dropped from 30% to 15%
│
├── Drill by VST → India VST P50 rose from 0.8s to 3s
│
├── Drill by CDN → CDN-A edge node latency elevated in India
│
├── Ping test → CDN-A Mumbai PoP latency 400ms for 4 hours
│
└── Action: Route India traffic to CDN-B, escalate to CDN-A support
Build vs Buy: Mux, Conviva, or Self-Built?
Managed Services
| Service | Strengths |
|---|---|
| Mux | Developer-friendly, simple integration, ~$1.25/1K sessions |
| Conviva | Enterprise-grade, most comprehensive, most expensive |
| Datadog RUM | Integrated APM in one platform |
| NPAW (YOUBORA) | Strong in European markets |
Pros: Hours to integrate, dashboards out of the box, zero maintenance. Cons: Expensive at scale, data lives with third party, limited customization.
Self-Built
Pros: Full customization, data can be joined with business metrics (orders, retention), cost advantage at scale. Cons: High development/maintenance cost, multi-platform SDK consistency is hard.
Common Evolution
- Early stage: Buy Mux — get usable dashboards fast
- At scale: Self-built pipeline + keep Mux as a benchmark for comparison
Client SDK Best Practices
Don’t Slow Down Playback
The QoE SDK itself must not degrade the experience:
- Report on a separate low-priority thread
- Network failures: silent retry, never block UI
- SDK crash must not bring down the app
Offline Compensation
Users may finish watching offline. When back online:
- Events written to local SQLite/file during offline
- Batch-upload in FIFO order when connectivity returns
Clock Alignment
Device clocks can be inaccurate:
- Use server timestamps (HTTP
Dateheader) as reference - Events carry relative time (
delta_msfrom session start)
Sampling at Scale
At massive scale, 100% reporting is too expensive:
- Critical error events: Always 100% reported
- Normal events: Sample at 10–30%
- Hash by user_id to ensure all-or-nothing per user (preserves session analysis)
Essential Dashboards
Dashboard 1: Global Overview
- DAU, total play sessions
- VST P50 / P95
- RBR, VSF, VPF
- SPI (composite score)
- Top 10 countries drill-down
Dashboard 2: CDN Health
- Per-CDN RBR, VST, error rate
- CDN comparison panel (same time window)
- CDN edge node map
Dashboard 3: Content Quality
- New title first-24-hour quality metrics
- Per-title completion rate and RBR
- Anomalous title alerts
Dashboard 4: Device and Version
- Error rate by app version
- VST by OS version
- RBR by device model
The QoE Optimization Loop
QoE data isn’t for passive observation — it drives engineering decisions:
Data reveals problem
│
▼
Locate root cause
(CDN? Encoding? ABR?)
│
▼
Try fix + A/B test
│
▼
Verify QoE improved
│
▼
Ship to 100% + keep monitoring
Weekly QoE review is standard practice for every mature video team.
Key Takeaways
- QoE measures user-perceived experience, not network metrics.
- Six core metrics: VST / RBR / VSF / EBVS / VPF / Average Bitrate.
- Conviva’s SPI is a composite “good experience session ratio.”
- Data must be sliced by multiple dimensions — a single global number can’t locate problems.
- Standard pipeline: Client SDK -> Kafka -> Flink/ClickHouse -> BI.
- Start with Mux/Conviva in early stages; build in-house when scale justifies it.
- QoE data drives decisions — review weekly.
Previous: Part 9: Video Players