Big Data on AWS Deep Dive

A 10-part series covering everything you need to build a production data warehouse and recommendation system on AWS — from OLTP vs. OLAP basics to a full end-to-end architecture with cost estimates.

zhuermu · 10 Chapters · ~130 min total

big-dataawsdata-lakeicebergrecommendation-systemsagemaker

Who Is This For?

Backend engineers and data engineers who know SQL and have used a relational database, but haven't built a data lake or recommendation system before. No Hadoop experience required — we start from scratch and build up to a production architecture on AWS.

Part 1: Big Data Fundamentals

Data lake vs. warehouse vs. lakehouse, OLTP vs. OLAP, batch vs. stream

Part 2: Storage & File Formats

S3 object storage, Parquet columnar format, Apache Iceberg internals

Part 3: Data Ingestion

DMS CDC, Aurora Zero-ETL, Firehose micro-batching, MSK / Kafka

Part 4: Metadata & Query Engines

Glue Data Catalog, Athena serverless SQL, Lake Formation permissions

Part 5: Compute & Orchestration

EMR Serverless, Glue ETL, Managed Flink, MWAA, Step Functions

Part 6: End-to-End Data Pipeline

From client click → MSK → S3 → ODS → DWD → DWS → ADS → DynamoDB

Part 7: Recommendation Fundamentals

Funnel architecture, two-tower recall, PIT training correctness

Part 8: Online Feature Stores

DynamoDB, ElastiCache, OpenSearch k-NN, Neptune graph recall

Part 9: SageMaker & ML Platform

Studio, Feature Store, Training Jobs, Endpoints, Model Monitor

Part 10: Full Architecture & Costs

Complete blueprint, every AWS service mapped, monthly cost breakdown

Learning Paths

Data Engineer Track: Chapters 1-6 cover data lake foundations, ingestion, processing, and pipeline orchestration.

ML Engineer Track: Chapters 7-9 cover recommendation systems, feature stores, and the SageMaker ML platform.

Full Stack / Architect: All 10 chapters — from fundamentals to the complete architecture blueprint with cost estimates.