How much does a production big data + recommendation system cost on AWS?

For a social app with 1M DAU and 100M daily events, expect roughly $3,000-5,000/month covering Aurora, MSK, S3, Glue, EMR Serverless, DynamoDB, ElastiCache, OpenSearch, SageMaker endpoints, and MWAA — depending on reserved vs on-demand pricing.

What are the key cost optimization strategies for AWS big data?

Use Iceberg partition pruning and columnar formats to reduce Athena scan costs. Use EMR Serverless and Glue auto-scaling to avoid idle clusters. Choose S3 Intelligent-Tiering for cold data. Use reserved capacity for steady-state DynamoDB and ElastiCache workloads.

Big Data on AWS Deep Dive (Part 10): Full Architecture Blueprint and Cost Breakdown

This chapter assembles everything from the previous 9 chapters into a single picture. For the customer’s social app scenario, we present:

A complete end-to-end architecture diagram (one view of the entire system)
Monthly cost estimates for each component
Phased implementation recommendations
Key operational and governance considerations

Full End-to-End Architecture

Full Architecture

Organized by layer from top to bottom:

Layer	Components
Data Sources	Aurora MySQL / DocumentDB / OpenSearch / Client SDK / User Requests
Ingestion	Aurora Zero-ETL / DMS / OSI / API GW + Lambda + MSK + Firehose
Data Lake	S3 + Iceberg ODS Layer
Layered Processing	Glue (DWD) + Athena (DWS) + EMR Serverless (ADS), orchestrated by MWAA
ML Training	SageMaker Training + Processing + Feature Store (optional)
Online Serving	DynamoDB + ElastiCache + OpenSearch k-NN + Neptune (optional) + SM Endpoint
Real-Time Pipeline	Managed Flink + Lambda
Metadata / Governance	Glue Data Catalog + Lake Formation + CloudWatch + Schema Registry

Data Flow Overview (Chronological)

Offline Batch Processing (Nightly)

02:00  Aurora Zero-ETL (continuous — data already in lake by early morning)
02:00  DMS DocumentDB CDC (continuous)
02:00  Firehose (continuous)
02:30  Glue Spark: dwd_user_action / dwd_post / dwd_user_relation
03:00  Athena CTAS: dws_user_daily / dws_post_daily / dws_pair_interaction
03:30  EMR Serverless: ads_user_features / ads_post_features
       EMR Serverless: ads_sample_follow / ads_sample_ctr
       EMR Serverless: ads_recall_u2u_cf
04:30  Glue Job: sync_user_features → DynamoDB
       Glue Job: sync_recall_pool → DynamoDB
05:00  SageMaker Training: train_recall_two_tower
       SageMaker Training: train_rank_lgb / deepfm
05:30  SageMaker Processing: compute_item_embeddings → OpenSearch
06:00  Lambda: deploy SageMaker Endpoints (Canary 10%)
06:30  Slack notification: DAG complete / DQ report

Real-Time Stream (Continuous)

User click → SDK → API GW → Lambda → MSK
                                   ├─→ Firehose (60s buffer) → S3
                                   ├─→ Flink (sub-second) → DynamoDB user_realtime
                                   └─→ Lambda fraud/risk detection

Online Inference Flow (Every Feed Refresh)

User request GET /feed (200ms budget)
  │
  ├─[5ms]──── DynamoDB user_features
  ├─[5ms]──── DynamoDB user_realtime
  ├─[1ms]──── Redis last_recommend_cache
  ├─[10ms]──── User Tower Endpoint → user embedding
  ├─[20ms]──── OpenSearch k-NN → 1000 candidates
  ├─[5ms]──── DynamoDB recall_u2u_cf → 200 candidates
  ├─[10ms]──── batch fetch 1200 item features (DynamoDB)
  ├─[30ms]──── Rank Endpoint scoring
  ├─[5ms]───── re-ranking (diversity)
  └─return Top 10

Monthly Cost Estimates

Warning: Estimates are based on us-east-1 list prices. Assumptions: 1M DAU, 100M daily events, 30 TB active data. Actual pricing depends on current AWS rates.

Data Storage + Ingestion

Service	Usage	Monthly Cost
S3 Standard + IT	30 TB active + 100 TB cold archive	~$700 + $400 = $1,100
Aurora MySQL	r6g.xlarge x 2 + storage	~$700
Aurora Zero-ETL to Lakehouse	Per change volume	~$200
DocumentDB	r6g.large x 3	~$1,500
DMS	dms.t3.medium x 1	~$50
OpenSearch (business search)	r6g.large x 3	~$700
OpenSearch Ingestion	1 OCU	~$170

Event Tracking Pipeline

Service	Usage	Monthly Cost
API Gateway HTTP API	100M requests/day = 3B/month	~$3,000
Lambda (enrichment)	3B invocations, 128MB x 50ms	~$300
MSK (Provisioned)	m7g.large x 3 + 5 TB storage	~$500
Firehose	3B records + Parquet conversion	~$300

Layered Processing + Orchestration

Service	Usage	Monthly Cost
Glue ETL	50 DPU-hours/day	~$650
Athena	100 TB scanned/month (incl. ad-hoc)	~$500
EMR Serverless	100 vCPU-hours/day for ADS processing	~$300
MWAA	mw1.small	~$400

ML Training

Service	Usage	Monthly Cost
SageMaker Training	g5.xlarge x 8h x 1/day x 30 days (two-tower + LightGBM ranking)	~$340
SageMaker Processing	m5.2xlarge x 2h x 1/day x 30 days (item embedding pre-computation)	~$25
SageMaker Endpoint (User Tower)	ml.c5.xlarge x 2, 24x7	~$200
SageMaker Endpoint (Rank)	ml.c5.2xlarge x 4, 24x7	~$800

Online Serving Layer (Recommendation Inference Read Path)

Service	Usage	Monthly Cost
DynamoDB	50K WCU + 200K RCU + 500GB	~$1,500
ElastiCache Redis	r7g.large x 2	~$300
OpenSearch k-NN	r6g.xlarge x 3 + 500GB	~$1,200
Neptune (Phase 3)	r6g.large x 2	~$700

Real-Time Pipeline

Service	Usage	Monthly Cost
Managed Flink	4 KPU	~$330

Governance / Monitoring

Service	Monthly Cost
Glue Catalog	~$10
Lake Formation	$0 (free tier)
CloudWatch	~$200
Data transfer	~$200

Summary (Split by Phase, Matching Customer T+1 Requirements vs Full Rollout)

Module	Phase 1 (T+1 Data Foundation)	Phase 2 (+ML v1)	Phase 3+ (Real-Time + Full)
Data Storage + Ingestion	~$3,200	~$4,200	~$5,400
Event Tracking Pipeline	~$3,400 (no MSK, Firehose only)	~$3,400	~$4,100 (+MSK)
Layered Processing + Orchestration	~$1,500	~$1,850	~$1,850
ML Training	$0	~$700	~$1,400
Online Serving	$0	~$2,000 (DDB+Redis+OS kNN)	~$3,700 (+Neptune optional)
Real-Time Pipeline (Flink)	$0	$0	~$330
Cross-AZ Traffic + Governance + Misc	~$300	~$400	~$600
Total (list price)	~$8,400/mo	~$12,550/mo	~$17,400/mo (excl. Neptune ~$700)

Warning: MSK cross-AZ replication traffic, CloudWatch log ingestion, and NAT Gateway traffic can add up to hundreds of dollars per month combined — these are commonly overlooked in estimates.

These are list price estimates. With EDP/PPA discounts, Reserved Instances, and Savings Plans, actual costs typically drop 20-40%.

Cost Distribution (Mental Model for Phase 3 Full List Price)

Ingestion + Storage   ████████ 31%
Event Tracking        ███████  24%
Online Serving        ██████   21%
Compute / Processing  ███      11%
ML Training           ██        8%
Other (incl. cross-AZ)██       5%

The three areas most worth optimizing:

API Gateway call volume — Have the SDK batch-merge event uploads; can reduce costs by 50%+
DynamoDB — Switch from On-Demand to Provisioned + Auto Scaling for 40%+ savings; eventually consistent reads save another 50%
Cross-AZ traffic / NAT Gateway — Use VPC Endpoints for S3 / DDB / Athena to route via private network; often saves hundreds to thousands of dollars

Phased Implementation Recommendations

Phase 1: Data Foundation (1-1.5 months)

Goal: Get event tracking, business databases, and the ODS layer running end-to-end; Athena can query all data sources.

Aurora Zero-ETL (path 1) / DMS (path 3) dual-link POC
DocumentDB Change Streams + DMS enabled
OpenSearch Ingestion pipeline: ES to S3
API GW + Lambda + MSK + Firehose event tracking pipeline
Glue Catalog registers all ODS tables
DWD v1 (Glue Job: cleansing + IP-to-geo)
MWAA running baseline DAG

Deliverable: T+1 data fully landed in the lake; Athena can query all ODS/DWD tables.

Phase 2: First Model Version (1-2 months)

Goal: End-to-end recommendation pipeline is functional, Top 10 recommendations have a baseline.

DWS / ADS layer SQL orchestration
Follow-sample table + CTR-sample table (PIT-correct)
User feature / content feature wide tables
LightGBM ranking model + two-tower recall model training
OpenSearch k-NN item vector index
DynamoDB sync for user features + recall pool
SageMaker Endpoint deployment: User Tower + Rank Model
Recommendation service (ECS) end-to-end integration
A/B testing platform integration

Deliverable: Core metrics (CTR / follow conversion rate / retention) have a baseline.

Phase 3: Advanced Capabilities (2-3 months)

Goal: Multi-channel recall, real-time features, model upgrades.

Managed Flink real-time feature pipeline
DynamoDB user_realtime goes live
Ranking model upgrade to DeepFM / DIN
Multi-channel recall (graph recall / interest tags / trending)
Introduce Neptune + Neptune ML
SageMaker Feature Store full migration (if cost-justified)

Phase 4: Continuous Optimization (Ongoing)

Cold-start optimization (new users / new content)
Multi-task models (MMoE / PLE)
Re-ranking (diversity / fairness)
Automated model monitoring (Model Monitor)
Cost optimization (RI / Savings Plans / caching)

Governance and Operations

Data Quality (DQ)

Run DQ checks after each DAG step:

Check Type	Example
Row count	dwd_user_action today > 100M
Uniqueness	event_id has no duplicates
Null rate	user_id null < 0.1%
Range	age between 0 and 150
Consistency	dws_user_daily total clicks = ods_event daily click count

Tools: AWS Glue Data Quality (based on Deequ) / custom SQL checks.

Cost Management

CloudWatch Anomaly Detection — billing anomaly alerts
Athena Workgroup Cost Limit — prevent runaway queries from blowing up costs
DynamoDB Auto Scaling — automatically scales down when traffic drops
S3 Intelligent-Tiering — enable for all buckets
Cost Allocation Tags — tag every resource with team=algo / team=da / team=infra for per-team billing attribution

Security

Encryption at rest: S3 SSE-KMS / DynamoDB encryption / RDS encryption
VPC Endpoints: Route S3 / DynamoDB / Athena via private network, never public internet
IAM Role least privilege
Lake Formation column-level permissions: mask phone numbers / national IDs
GuardDuty + Security Hub: threat detection

Disaster Recovery

S3 Cross-Region Replication (for active-active requirements)
DynamoDB Global Tables (multi-Region real-time sync)
Aurora Global Database (cross-Region DR)

Model Governance

Model versioning (SageMaker Model Registry)
Canary deployments (Production Variants with weighted traffic splitting)
Model Monitor for drift detection
Scheduled retraining (weekly / daily) + automated evaluation

AWS Data/AI Stack Updates in 2026 (Quick Reference)

A comparison of “now vs one year ago” for this architecture, helping customers decide which new capabilities to adopt:

Module	Early 2025	May 2026 Status
Iceberg Managed	S3 Tables just GA	S3 Tables + auto compaction, snapshot expiration, replication, IT tiering
Zero-ETL	Aurora to Lakehouse just GA	+ RDS MySQL / DynamoDB / Salesforce / SAP / ServiceNow / Zendesk to Lakehouse (many new integrations)
Glue	4.0 (Spark 3.3)	Glue 5.0 (Spark 3.5 + Iceberg auto maintenance)
Athena	Engine v3	+ Federated Spark / Iceberg materialized views (some regions in preview)
OpenSearch k-NN	nmslib / Lucene default	Faiss recommended for production, nmslib deprecated
Vector Storage	OpenSearch k-NN	+ S3 Vectors GA (ideal for RAG / cold vectors)
SageMaker Branding	4 pillars just split	Unified Studio GA, Bedrock IDE integration, Q Developer embedded
GenAI	Bedrock foundation models	+ Amazon Nova series (Pro/Premier/Canvas/Reel), Bedrock AgentCore, Knowledge Bases + S3 Vectors / Structured Data Retrieval
HyperPod	Already GA	+ Task governance / flexible training plans / Inference Components
Q in QuickSight	Topics + Q&A	+ Scenarios (what-if natural language), auto Topic generation

Specific impact on the customer’s social app recommendation scenario:

S3 Tables auto compaction — eliminates one maintenance job
Glue 5.0 + Iceberg — DWD processing performance improvement
Faiss engine — lower OpenSearch k-NN recall latency
Bedrock + Knowledge Bases — ready for future “AI content moderation / comment summarization / intelligent customer service” without self-training an LLM

Open Questions for the Customer

Questions to resolve before completing the POC:

DocumentDB version: 5.0 (confirmed) — Change Streams available
Latency requirement: T+1 (confirmed) — Phase 1-2 does not need Flink
ES ingestion: Required (confirmed) — Phase 1 runs OSI
What does ES store? Does it overlap with MySQL?
Recommendation scenarios: Follow recommendations / Feed / PYMK — which is the priority?
Existing event tracking SDK / A/B platform / data team size?
Region selection: us-east-1 / ap-northeast-1 / other?
Compliance requirements: GDPR / China PIPL?

These questions determine specific tradeoffs and should be resolved during the Discovery phase.

Series Conclusion

Congratulations on making it here. You have now covered:

Big data fundamentals (OLTP / OLAP / data lake / Lakehouse / CDC / layered architecture)
S3 + Parquet + Iceberg storage foundation
DMS / Zero-ETL / OSI / Firehose / MSK data ingestion
Glue Catalog / Athena / Lake Formation metadata + querying
Glue / EMR / Flink / Lambda / MWAA compute and orchestration
Recommendation system funnel / two-tower / feature engineering / PIT
DynamoDB / Redis / OpenSearch k-NN / Neptune online serving
SageMaker full lifecycle
End-to-end architecture + cost estimates

Further Resources

Topic	Link
AWS Modern Data Architecture	https://docs.aws.amazon.com/whitepapers/latest/modern-data-architecture-rays-on-aws/modern-data-architecture-rays-on-aws.html
Apache Iceberg on AWS	https://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/introduction.html
Building Recommendation Systems on AWS	https://aws.amazon.com/solutions/implementations/personalized-recommendations/
AWS Big Data Blog	https://aws.amazon.com/blogs/big-data/
Iceberg Official	https://iceberg.apache.org/
Feast (Open-Source Feature Store)	https://feast.dev/
DGL (Graph Neural Network Library)	https://www.dgl.ai/