Big Data on AWS Deep Dive (Part 10): Full Architecture Blueprint and Cost Breakdown
The complete end-to-end architecture for a social app's data warehouse and recommendation system on AWS — every service mapped, with real monthly cost estimates and optimization strategies.
This chapter assembles everything from the previous 9 chapters into a single picture. For the customer’s social app scenario, we present:
- A complete end-to-end architecture diagram (one view of the entire system)
- Monthly cost estimates for each component
- Phased implementation recommendations
- Key operational and governance considerations
Full End-to-End Architecture
Organized by layer from top to bottom:
| Layer | Components |
|---|---|
| Data Sources | Aurora MySQL / DocumentDB / OpenSearch / Client SDK / User Requests |
| Ingestion | Aurora Zero-ETL / DMS / OSI / API GW + Lambda + MSK + Firehose |
| Data Lake | S3 + Iceberg ODS Layer |
| Layered Processing | Glue (DWD) + Athena (DWS) + EMR Serverless (ADS), orchestrated by MWAA |
| ML Training | SageMaker Training + Processing + Feature Store (optional) |
| Online Serving | DynamoDB + ElastiCache + OpenSearch k-NN + Neptune (optional) + SM Endpoint |
| Real-Time Pipeline | Managed Flink + Lambda |
| Metadata / Governance | Glue Data Catalog + Lake Formation + CloudWatch + Schema Registry |
Data Flow Overview (Chronological)
Offline Batch Processing (Nightly)
02:00 Aurora Zero-ETL (continuous — data already in lake by early morning)
02:00 DMS DocumentDB CDC (continuous)
02:00 Firehose (continuous)
02:30 Glue Spark: dwd_user_action / dwd_post / dwd_user_relation
03:00 Athena CTAS: dws_user_daily / dws_post_daily / dws_pair_interaction
03:30 EMR Serverless: ads_user_features / ads_post_features
EMR Serverless: ads_sample_follow / ads_sample_ctr
EMR Serverless: ads_recall_u2u_cf
04:30 Glue Job: sync_user_features → DynamoDB
Glue Job: sync_recall_pool → DynamoDB
05:00 SageMaker Training: train_recall_two_tower
SageMaker Training: train_rank_lgb / deepfm
05:30 SageMaker Processing: compute_item_embeddings → OpenSearch
06:00 Lambda: deploy SageMaker Endpoints (Canary 10%)
06:30 Slack notification: DAG complete / DQ report
Real-Time Stream (Continuous)
User click → SDK → API GW → Lambda → MSK
├─→ Firehose (60s buffer) → S3
├─→ Flink (sub-second) → DynamoDB user_realtime
└─→ Lambda fraud/risk detection
Online Inference Flow (Every Feed Refresh)
User request GET /feed (200ms budget)
│
├─[5ms]──── DynamoDB user_features
├─[5ms]──── DynamoDB user_realtime
├─[1ms]──── Redis last_recommend_cache
├─[10ms]──── User Tower Endpoint → user embedding
├─[20ms]──── OpenSearch k-NN → 1000 candidates
├─[5ms]──── DynamoDB recall_u2u_cf → 200 candidates
├─[10ms]──── batch fetch 1200 item features (DynamoDB)
├─[30ms]──── Rank Endpoint scoring
├─[5ms]───── re-ranking (diversity)
└─return Top 10
Monthly Cost Estimates
Warning: Estimates are based on us-east-1 list prices. Assumptions: 1M DAU, 100M daily events, 30 TB active data. Actual pricing depends on current AWS rates.
Data Storage + Ingestion
| Service | Usage | Monthly Cost |
|---|---|---|
| S3 Standard + IT | 30 TB active + 100 TB cold archive | ~$700 + |
| Aurora MySQL | r6g.xlarge x 2 + storage | ~$700 |
| Aurora Zero-ETL to Lakehouse | Per change volume | ~$200 |
| DocumentDB | r6g.large x 3 | ~$1,500 |
| DMS | dms.t3.medium x 1 | ~$50 |
| OpenSearch (business search) | r6g.large x 3 | ~$700 |
| OpenSearch Ingestion | 1 OCU | ~$170 |
Event Tracking Pipeline
| Service | Usage | Monthly Cost |
|---|---|---|
| API Gateway HTTP API | 100M requests/day = 3B/month | ~$3,000 |
| Lambda (enrichment) | 3B invocations, 128MB x 50ms | ~$300 |
| MSK (Provisioned) | m7g.large x 3 + 5 TB storage | ~$500 |
| Firehose | 3B records + Parquet conversion | ~$300 |
Layered Processing + Orchestration
| Service | Usage | Monthly Cost |
|---|---|---|
| Glue ETL | 50 DPU-hours/day | ~$650 |
| Athena | 100 TB scanned/month (incl. ad-hoc) | ~$500 |
| EMR Serverless | 100 vCPU-hours/day for ADS processing | ~$300 |
| MWAA | mw1.small | ~$400 |
ML Training
| Service | Usage | Monthly Cost |
|---|---|---|
| SageMaker Training | g5.xlarge x 8h x 1/day x 30 days (two-tower + LightGBM ranking) | ~$340 |
| SageMaker Processing | m5.2xlarge x 2h x 1/day x 30 days (item embedding pre-computation) | ~$25 |
| SageMaker Endpoint (User Tower) | ml.c5.xlarge x 2, 24x7 | ~$200 |
| SageMaker Endpoint (Rank) | ml.c5.2xlarge x 4, 24x7 | ~$800 |
Online Serving Layer (Recommendation Inference Read Path)
| Service | Usage | Monthly Cost |
|---|---|---|
| DynamoDB | 50K WCU + 200K RCU + 500GB | ~$1,500 |
| ElastiCache Redis | r7g.large x 2 | ~$300 |
| OpenSearch k-NN | r6g.xlarge x 3 + 500GB | ~$1,200 |
| Neptune (Phase 3) | r6g.large x 2 | ~$700 |
Real-Time Pipeline
| Service | Usage | Monthly Cost |
|---|---|---|
| Managed Flink | 4 KPU | ~$330 |
Governance / Monitoring
| Service | Monthly Cost |
|---|---|
| Glue Catalog | ~$10 |
| Lake Formation | $0 (free tier) |
| CloudWatch | ~$200 |
| Data transfer | ~$200 |
Summary (Split by Phase, Matching Customer T+1 Requirements vs Full Rollout)
| Module | Phase 1 (T+1 Data Foundation) | Phase 2 (+ML v1) | Phase 3+ (Real-Time + Full) |
|---|---|---|---|
| Data Storage + Ingestion | ~$3,200 | ~$4,200 | ~$5,400 |
| Event Tracking Pipeline | ~$3,400 (no MSK, Firehose only) | ~$3,400 | ~$4,100 (+MSK) |
| Layered Processing + Orchestration | ~$1,500 | ~$1,850 | ~$1,850 |
| ML Training | $0 | ~$700 | ~$1,400 |
| Online Serving | $0 | ~$2,000 (DDB+Redis+OS kNN) | ~$3,700 (+Neptune optional) |
| Real-Time Pipeline (Flink) | $0 | $0 | ~$330 |
| Cross-AZ Traffic + Governance + Misc | ~$300 | ~$400 | ~$600 |
| Total (list price) | ~$8,400/mo | ~$12,550/mo | ~$17,400/mo (excl. Neptune ~$700) |
Warning: MSK cross-AZ replication traffic, CloudWatch log ingestion, and NAT Gateway traffic can add up to hundreds of dollars per month combined — these are commonly overlooked in estimates.
These are list price estimates. With EDP/PPA discounts, Reserved Instances, and Savings Plans, actual costs typically drop 20-40%.
Cost Distribution (Mental Model for Phase 3 Full List Price)
Ingestion + Storage ████████ 31%
Event Tracking ███████ 24%
Online Serving ██████ 21%
Compute / Processing ███ 11%
ML Training ██ 8%
Other (incl. cross-AZ)██ 5%
The three areas most worth optimizing:
- API Gateway call volume — Have the SDK batch-merge event uploads; can reduce costs by 50%+
- DynamoDB — Switch from On-Demand to Provisioned + Auto Scaling for 40%+ savings; eventually consistent reads save another 50%
- Cross-AZ traffic / NAT Gateway — Use VPC Endpoints for S3 / DDB / Athena to route via private network; often saves hundreds to thousands of dollars
Phased Implementation Recommendations
Phase 1: Data Foundation (1-1.5 months)
Goal: Get event tracking, business databases, and the ODS layer running end-to-end; Athena can query all data sources.
- Aurora Zero-ETL (path 1) / DMS (path 3) dual-link POC
- DocumentDB Change Streams + DMS enabled
- OpenSearch Ingestion pipeline: ES to S3
- API GW + Lambda + MSK + Firehose event tracking pipeline
- Glue Catalog registers all ODS tables
- DWD v1 (Glue Job: cleansing + IP-to-geo)
- MWAA running baseline DAG
Deliverable: T+1 data fully landed in the lake; Athena can query all ODS/DWD tables.
Phase 2: First Model Version (1-2 months)
Goal: End-to-end recommendation pipeline is functional, Top 10 recommendations have a baseline.
- DWS / ADS layer SQL orchestration
- Follow-sample table + CTR-sample table (PIT-correct)
- User feature / content feature wide tables
- LightGBM ranking model + two-tower recall model training
- OpenSearch k-NN item vector index
- DynamoDB sync for user features + recall pool
- SageMaker Endpoint deployment: User Tower + Rank Model
- Recommendation service (ECS) end-to-end integration
- A/B testing platform integration
Deliverable: Core metrics (CTR / follow conversion rate / retention) have a baseline.
Phase 3: Advanced Capabilities (2-3 months)
Goal: Multi-channel recall, real-time features, model upgrades.
- Managed Flink real-time feature pipeline
- DynamoDB user_realtime goes live
- Ranking model upgrade to DeepFM / DIN
- Multi-channel recall (graph recall / interest tags / trending)
- Introduce Neptune + Neptune ML
- SageMaker Feature Store full migration (if cost-justified)
Phase 4: Continuous Optimization (Ongoing)
- Cold-start optimization (new users / new content)
- Multi-task models (MMoE / PLE)
- Re-ranking (diversity / fairness)
- Automated model monitoring (Model Monitor)
- Cost optimization (RI / Savings Plans / caching)
Governance and Operations
Data Quality (DQ)
Run DQ checks after each DAG step:
| Check Type | Example |
|---|---|
| Row count | dwd_user_action today > 100M |
| Uniqueness | event_id has no duplicates |
| Null rate | user_id null < 0.1% |
| Range | age between 0 and 150 |
| Consistency | dws_user_daily total clicks = ods_event daily click count |
Tools: AWS Glue Data Quality (based on Deequ) / custom SQL checks.
Cost Management
- CloudWatch Anomaly Detection — billing anomaly alerts
- Athena Workgroup Cost Limit — prevent runaway queries from blowing up costs
- DynamoDB Auto Scaling — automatically scales down when traffic drops
- S3 Intelligent-Tiering — enable for all buckets
- Cost Allocation Tags — tag every resource with
team=algo / team=da / team=infrafor per-team billing attribution
Security
- Encryption at rest: S3 SSE-KMS / DynamoDB encryption / RDS encryption
- VPC Endpoints: Route S3 / DynamoDB / Athena via private network, never public internet
- IAM Role least privilege
- Lake Formation column-level permissions: mask phone numbers / national IDs
- GuardDuty + Security Hub: threat detection
Disaster Recovery
- S3 Cross-Region Replication (for active-active requirements)
- DynamoDB Global Tables (multi-Region real-time sync)
- Aurora Global Database (cross-Region DR)
Model Governance
- Model versioning (SageMaker Model Registry)
- Canary deployments (Production Variants with weighted traffic splitting)
- Model Monitor for drift detection
- Scheduled retraining (weekly / daily) + automated evaluation
AWS Data/AI Stack Updates in 2026 (Quick Reference)
A comparison of “now vs one year ago” for this architecture, helping customers decide which new capabilities to adopt:
| Module | Early 2025 | May 2026 Status |
|---|---|---|
| Iceberg Managed | S3 Tables just GA | S3 Tables + auto compaction, snapshot expiration, replication, IT tiering |
| Zero-ETL | Aurora to Lakehouse just GA | + RDS MySQL / DynamoDB / Salesforce / SAP / ServiceNow / Zendesk to Lakehouse (many new integrations) |
| Glue | 4.0 (Spark 3.3) | Glue 5.0 (Spark 3.5 + Iceberg auto maintenance) |
| Athena | Engine v3 | + Federated Spark / Iceberg materialized views (some regions in preview) |
| OpenSearch k-NN | nmslib / Lucene default | Faiss recommended for production, nmslib deprecated |
| Vector Storage | OpenSearch k-NN | + S3 Vectors GA (ideal for RAG / cold vectors) |
| SageMaker Branding | 4 pillars just split | Unified Studio GA, Bedrock IDE integration, Q Developer embedded |
| GenAI | Bedrock foundation models | + Amazon Nova series (Pro/Premier/Canvas/Reel), Bedrock AgentCore, Knowledge Bases + S3 Vectors / Structured Data Retrieval |
| HyperPod | Already GA | + Task governance / flexible training plans / Inference Components |
| Q in QuickSight | Topics + Q&A | + Scenarios (what-if natural language), auto Topic generation |
Specific impact on the customer’s social app recommendation scenario:
- S3 Tables auto compaction — eliminates one maintenance job
- Glue 5.0 + Iceberg — DWD processing performance improvement
- Faiss engine — lower OpenSearch k-NN recall latency
- Bedrock + Knowledge Bases — ready for future “AI content moderation / comment summarization / intelligent customer service” without self-training an LLM
Open Questions for the Customer
Questions to resolve before completing the POC:
- DocumentDB version: 5.0 (confirmed) — Change Streams available
- Latency requirement: T+1 (confirmed) — Phase 1-2 does not need Flink
- ES ingestion: Required (confirmed) — Phase 1 runs OSI
- What does ES store? Does it overlap with MySQL?
- Recommendation scenarios: Follow recommendations / Feed / PYMK — which is the priority?
- Existing event tracking SDK / A/B platform / data team size?
- Region selection: us-east-1 / ap-northeast-1 / other?
- Compliance requirements: GDPR / China PIPL?
These questions determine specific tradeoffs and should be resolved during the Discovery phase.
Series Conclusion
Congratulations on making it here. You have now covered:
- Big data fundamentals (OLTP / OLAP / data lake / Lakehouse / CDC / layered architecture)
- S3 + Parquet + Iceberg storage foundation
- DMS / Zero-ETL / OSI / Firehose / MSK data ingestion
- Glue Catalog / Athena / Lake Formation metadata + querying
- Glue / EMR / Flink / Lambda / MWAA compute and orchestration
- Recommendation system funnel / two-tower / feature engineering / PIT
- DynamoDB / Redis / OpenSearch k-NN / Neptune online serving
- SageMaker full lifecycle
- End-to-end architecture + cost estimates
Further Resources
| Topic | Link |
|---|---|
| AWS Modern Data Architecture | https://docs.aws.amazon.com/whitepapers/latest/modern-data-architecture-rays-on-aws/modern-data-architecture-rays-on-aws.html |
| Apache Iceberg on AWS | https://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/introduction.html |
| Building Recommendation Systems on AWS | https://aws.amazon.com/solutions/implementations/personalized-recommendations/ |
| AWS Big Data Blog | https://aws.amazon.com/blogs/big-data/ |
| Iceberg Official | https://iceberg.apache.org/ |
| Feast (Open-Source Feature Store) | https://feast.dev/ |
| DGL (Graph Neural Network Library) | https://www.dgl.ai/ |