Big Data on AWS Deep Dive (Part 10): Full Architecture Blueprint and Cost Breakdown

The complete end-to-end architecture for a social app's data warehouse and recommendation system on AWS — every service mapped, with real monthly cost estimates and optimization strategies.

zhuermu · · 12 min
big-dataawsarchitecturecost-optimizationrecommendation-systemdata-warehouseproduction

This chapter assembles everything from the previous 9 chapters into a single picture. For the customer’s social app scenario, we present:

  1. A complete end-to-end architecture diagram (one view of the entire system)
  2. Monthly cost estimates for each component
  3. Phased implementation recommendations
  4. Key operational and governance considerations

Full End-to-End Architecture

Full Architecture

Organized by layer from top to bottom:

LayerComponents
Data SourcesAurora MySQL / DocumentDB / OpenSearch / Client SDK / User Requests
IngestionAurora Zero-ETL / DMS / OSI / API GW + Lambda + MSK + Firehose
Data LakeS3 + Iceberg ODS Layer
Layered ProcessingGlue (DWD) + Athena (DWS) + EMR Serverless (ADS), orchestrated by MWAA
ML TrainingSageMaker Training + Processing + Feature Store (optional)
Online ServingDynamoDB + ElastiCache + OpenSearch k-NN + Neptune (optional) + SM Endpoint
Real-Time PipelineManaged Flink + Lambda
Metadata / GovernanceGlue Data Catalog + Lake Formation + CloudWatch + Schema Registry

Data Flow Overview (Chronological)

Offline Batch Processing (Nightly)

02:00  Aurora Zero-ETL (continuous — data already in lake by early morning)
02:00  DMS DocumentDB CDC (continuous)
02:00  Firehose (continuous)
02:30  Glue Spark: dwd_user_action / dwd_post / dwd_user_relation
03:00  Athena CTAS: dws_user_daily / dws_post_daily / dws_pair_interaction
03:30  EMR Serverless: ads_user_features / ads_post_features
       EMR Serverless: ads_sample_follow / ads_sample_ctr
       EMR Serverless: ads_recall_u2u_cf
04:30  Glue Job: sync_user_features → DynamoDB
       Glue Job: sync_recall_pool → DynamoDB
05:00  SageMaker Training: train_recall_two_tower
       SageMaker Training: train_rank_lgb / deepfm
05:30  SageMaker Processing: compute_item_embeddings → OpenSearch
06:00  Lambda: deploy SageMaker Endpoints (Canary 10%)
06:30  Slack notification: DAG complete / DQ report

Real-Time Stream (Continuous)

User click → SDK → API GW → Lambda → MSK
                                   ├─→ Firehose (60s buffer) → S3
                                   ├─→ Flink (sub-second) → DynamoDB user_realtime
                                   └─→ Lambda fraud/risk detection

Online Inference Flow (Every Feed Refresh)

User request GET /feed (200ms budget)

  ├─[5ms]──── DynamoDB user_features
  ├─[5ms]──── DynamoDB user_realtime
  ├─[1ms]──── Redis last_recommend_cache
  ├─[10ms]──── User Tower Endpoint → user embedding
  ├─[20ms]──── OpenSearch k-NN → 1000 candidates
  ├─[5ms]──── DynamoDB recall_u2u_cf → 200 candidates
  ├─[10ms]──── batch fetch 1200 item features (DynamoDB)
  ├─[30ms]──── Rank Endpoint scoring
  ├─[5ms]───── re-ranking (diversity)
  └─return Top 10

Monthly Cost Estimates

Warning: Estimates are based on us-east-1 list prices. Assumptions: 1M DAU, 100M daily events, 30 TB active data. Actual pricing depends on current AWS rates.

Data Storage + Ingestion

ServiceUsageMonthly Cost
S3 Standard + IT30 TB active + 100 TB cold archive~$700 + $400 = **$1,100**
Aurora MySQLr6g.xlarge x 2 + storage~$700
Aurora Zero-ETL to LakehousePer change volume~$200
DocumentDBr6g.large x 3~$1,500
DMSdms.t3.medium x 1~$50
OpenSearch (business search)r6g.large x 3~$700
OpenSearch Ingestion1 OCU~$170

Event Tracking Pipeline

ServiceUsageMonthly Cost
API Gateway HTTP API100M requests/day = 3B/month~$3,000
Lambda (enrichment)3B invocations, 128MB x 50ms~$300
MSK (Provisioned)m7g.large x 3 + 5 TB storage~$500
Firehose3B records + Parquet conversion~$300

Layered Processing + Orchestration

ServiceUsageMonthly Cost
Glue ETL50 DPU-hours/day~$650
Athena100 TB scanned/month (incl. ad-hoc)~$500
EMR Serverless100 vCPU-hours/day for ADS processing~$300
MWAAmw1.small~$400

ML Training

ServiceUsageMonthly Cost
SageMaker Trainingg5.xlarge x 8h x 1/day x 30 days (two-tower + LightGBM ranking)~$340
SageMaker Processingm5.2xlarge x 2h x 1/day x 30 days (item embedding pre-computation)~$25
SageMaker Endpoint (User Tower)ml.c5.xlarge x 2, 24x7~$200
SageMaker Endpoint (Rank)ml.c5.2xlarge x 4, 24x7~$800

Online Serving Layer (Recommendation Inference Read Path)

ServiceUsageMonthly Cost
DynamoDB50K WCU + 200K RCU + 500GB~$1,500
ElastiCache Redisr7g.large x 2~$300
OpenSearch k-NNr6g.xlarge x 3 + 500GB~$1,200
Neptune (Phase 3)r6g.large x 2~$700

Real-Time Pipeline

ServiceUsageMonthly Cost
Managed Flink4 KPU~$330

Governance / Monitoring

ServiceMonthly Cost
Glue Catalog~$10
Lake Formation$0 (free tier)
CloudWatch~$200
Data transfer~$200

Summary (Split by Phase, Matching Customer T+1 Requirements vs Full Rollout)

ModulePhase 1 (T+1 Data Foundation)Phase 2 (+ML v1)Phase 3+ (Real-Time + Full)
Data Storage + Ingestion~$3,200~$4,200~$5,400
Event Tracking Pipeline~$3,400 (no MSK, Firehose only)~$3,400~$4,100 (+MSK)
Layered Processing + Orchestration~$1,500~$1,850~$1,850
ML Training$0~$700~$1,400
Online Serving$0~$2,000 (DDB+Redis+OS kNN)~$3,700 (+Neptune optional)
Real-Time Pipeline (Flink)$0$0~$330
Cross-AZ Traffic + Governance + Misc~$300~$400~$600
Total (list price)~$8,400/mo~$12,550/mo~$17,400/mo (excl. Neptune ~$700)

Warning: MSK cross-AZ replication traffic, CloudWatch log ingestion, and NAT Gateway traffic can add up to hundreds of dollars per month combined — these are commonly overlooked in estimates.

These are list price estimates. With EDP/PPA discounts, Reserved Instances, and Savings Plans, actual costs typically drop 20-40%.

Cost Distribution (Mental Model for Phase 3 Full List Price)

Ingestion + Storage   ████████ 31%
Event Tracking        ███████  24%
Online Serving        ██████   21%
Compute / Processing  ███      11%
ML Training           ██        8%
Other (incl. cross-AZ)██       5%

The three areas most worth optimizing:

  1. API Gateway call volume — Have the SDK batch-merge event uploads; can reduce costs by 50%+
  2. DynamoDB — Switch from On-Demand to Provisioned + Auto Scaling for 40%+ savings; eventually consistent reads save another 50%
  3. Cross-AZ traffic / NAT Gateway — Use VPC Endpoints for S3 / DDB / Athena to route via private network; often saves hundreds to thousands of dollars

Phased Implementation Recommendations

Phase 1: Data Foundation (1-1.5 months)

Goal: Get event tracking, business databases, and the ODS layer running end-to-end; Athena can query all data sources.

  • Aurora Zero-ETL (path 1) / DMS (path 3) dual-link POC
  • DocumentDB Change Streams + DMS enabled
  • OpenSearch Ingestion pipeline: ES to S3
  • API GW + Lambda + MSK + Firehose event tracking pipeline
  • Glue Catalog registers all ODS tables
  • DWD v1 (Glue Job: cleansing + IP-to-geo)
  • MWAA running baseline DAG

Deliverable: T+1 data fully landed in the lake; Athena can query all ODS/DWD tables.

Phase 2: First Model Version (1-2 months)

Goal: End-to-end recommendation pipeline is functional, Top 10 recommendations have a baseline.

  • DWS / ADS layer SQL orchestration
  • Follow-sample table + CTR-sample table (PIT-correct)
  • User feature / content feature wide tables
  • LightGBM ranking model + two-tower recall model training
  • OpenSearch k-NN item vector index
  • DynamoDB sync for user features + recall pool
  • SageMaker Endpoint deployment: User Tower + Rank Model
  • Recommendation service (ECS) end-to-end integration
  • A/B testing platform integration

Deliverable: Core metrics (CTR / follow conversion rate / retention) have a baseline.

Phase 3: Advanced Capabilities (2-3 months)

Goal: Multi-channel recall, real-time features, model upgrades.

  • Managed Flink real-time feature pipeline
  • DynamoDB user_realtime goes live
  • Ranking model upgrade to DeepFM / DIN
  • Multi-channel recall (graph recall / interest tags / trending)
  • Introduce Neptune + Neptune ML
  • SageMaker Feature Store full migration (if cost-justified)

Phase 4: Continuous Optimization (Ongoing)

  • Cold-start optimization (new users / new content)
  • Multi-task models (MMoE / PLE)
  • Re-ranking (diversity / fairness)
  • Automated model monitoring (Model Monitor)
  • Cost optimization (RI / Savings Plans / caching)

Governance and Operations

Data Quality (DQ)

Run DQ checks after each DAG step:

Check TypeExample
Row countdwd_user_action today > 100M
Uniquenessevent_id has no duplicates
Null rateuser_id null < 0.1%
Rangeage between 0 and 150
Consistencydws_user_daily total clicks = ods_event daily click count

Tools: AWS Glue Data Quality (based on Deequ) / custom SQL checks.

Cost Management

  • CloudWatch Anomaly Detection — billing anomaly alerts
  • Athena Workgroup Cost Limit — prevent runaway queries from blowing up costs
  • DynamoDB Auto Scaling — automatically scales down when traffic drops
  • S3 Intelligent-Tiering — enable for all buckets
  • Cost Allocation Tags — tag every resource with team=algo / team=da / team=infra for per-team billing attribution

Security

  • Encryption at rest: S3 SSE-KMS / DynamoDB encryption / RDS encryption
  • VPC Endpoints: Route S3 / DynamoDB / Athena via private network, never public internet
  • IAM Role least privilege
  • Lake Formation column-level permissions: mask phone numbers / national IDs
  • GuardDuty + Security Hub: threat detection

Disaster Recovery

  • S3 Cross-Region Replication (for active-active requirements)
  • DynamoDB Global Tables (multi-Region real-time sync)
  • Aurora Global Database (cross-Region DR)

Model Governance

  • Model versioning (SageMaker Model Registry)
  • Canary deployments (Production Variants with weighted traffic splitting)
  • Model Monitor for drift detection
  • Scheduled retraining (weekly / daily) + automated evaluation

AWS Data/AI Stack Updates in 2026 (Quick Reference)

A comparison of “now vs one year ago” for this architecture, helping customers decide which new capabilities to adopt:

ModuleEarly 2025May 2026 Status
Iceberg ManagedS3 Tables just GAS3 Tables + auto compaction, snapshot expiration, replication, IT tiering
Zero-ETLAurora to Lakehouse just GA+ RDS MySQL / DynamoDB / Salesforce / SAP / ServiceNow / Zendesk to Lakehouse (many new integrations)
Glue4.0 (Spark 3.3)Glue 5.0 (Spark 3.5 + Iceberg auto maintenance)
AthenaEngine v3+ Federated Spark / Iceberg materialized views (some regions in preview)
OpenSearch k-NNnmslib / Lucene defaultFaiss recommended for production, nmslib deprecated
Vector StorageOpenSearch k-NN+ S3 Vectors GA (ideal for RAG / cold vectors)
SageMaker Branding4 pillars just splitUnified Studio GA, Bedrock IDE integration, Q Developer embedded
GenAIBedrock foundation models+ Amazon Nova series (Pro/Premier/Canvas/Reel), Bedrock AgentCore, Knowledge Bases + S3 Vectors / Structured Data Retrieval
HyperPodAlready GA+ Task governance / flexible training plans / Inference Components
Q in QuickSightTopics + Q&A+ Scenarios (what-if natural language), auto Topic generation

Specific impact on the customer’s social app recommendation scenario:

  • S3 Tables auto compaction — eliminates one maintenance job
  • Glue 5.0 + Iceberg — DWD processing performance improvement
  • Faiss engine — lower OpenSearch k-NN recall latency
  • Bedrock + Knowledge Bases — ready for future “AI content moderation / comment summarization / intelligent customer service” without self-training an LLM

Open Questions for the Customer

Questions to resolve before completing the POC:

  1. DocumentDB version: 5.0 (confirmed) — Change Streams available
  2. Latency requirement: T+1 (confirmed) — Phase 1-2 does not need Flink
  3. ES ingestion: Required (confirmed) — Phase 1 runs OSI
  4. What does ES store? Does it overlap with MySQL?
  5. Recommendation scenarios: Follow recommendations / Feed / PYMK — which is the priority?
  6. Existing event tracking SDK / A/B platform / data team size?
  7. Region selection: us-east-1 / ap-northeast-1 / other?
  8. Compliance requirements: GDPR / China PIPL?

These questions determine specific tradeoffs and should be resolved during the Discovery phase.


Series Conclusion

Congratulations on making it here. You have now covered:

  • Big data fundamentals (OLTP / OLAP / data lake / Lakehouse / CDC / layered architecture)
  • S3 + Parquet + Iceberg storage foundation
  • DMS / Zero-ETL / OSI / Firehose / MSK data ingestion
  • Glue Catalog / Athena / Lake Formation metadata + querying
  • Glue / EMR / Flink / Lambda / MWAA compute and orchestration
  • Recommendation system funnel / two-tower / feature engineering / PIT
  • DynamoDB / Redis / OpenSearch k-NN / Neptune online serving
  • SageMaker full lifecycle
  • End-to-end architecture + cost estimates

Further Resources

TopicLink
AWS Modern Data Architecturehttps://docs.aws.amazon.com/whitepapers/latest/modern-data-architecture-rays-on-aws/modern-data-architecture-rays-on-aws.html
Apache Iceberg on AWShttps://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/introduction.html
Building Recommendation Systems on AWShttps://aws.amazon.com/solutions/implementations/personalized-recommendations/
AWS Big Data Bloghttps://aws.amazon.com/blogs/big-data/
Iceberg Officialhttps://iceberg.apache.org/
Feast (Open-Source Feature Store)https://feast.dev/
DGL (Graph Neural Network Library)https://www.dgl.ai/