Big Data on AWS Deep Dive (Part 5): EMR, Glue ETL, Flink, and Pipeline Orchestration
Compare EMR Serverless, Glue ETL, Managed Flink, and choose the right compute engine. Then orchestrate data pipelines with MWAA (Airflow) and Step Functions.
Once data lands in S3 and is registered in the Glue Catalog, the real work begins: layered ETL processing, stream processing, and scheduling.
Services covered in this chapter: EMR Serverless / AWS Glue ETL / Managed Flink / Lambda / MWAA / Step Functions.
Compute Engine Landscape
There are many options for “computing things” in the data lake. Here is a classification:
| Category | Service | Best For |
|---|---|---|
| SQL Engine (lightweight) | Athena CTAS / INSERT | Transformations expressible in simple SQL |
| Spark on Serverless | AWS Glue ETL / EMR Serverless | Medium-to-heavy batch processing |
| Spark on EC2 | EMR on EC2 | Highly customized / extreme Spot savings |
| Stream Processing | Amazon Managed Service for Apache Flink | Real-time features / real-time fraud detection |
| Lightweight Functions | AWS Lambda | Short tasks / small data / enrichment |
How to choose:
- A single SQL statement can do it all → Athena CTAS (cheapest)
- Need Python UDFs / complex transformations / large datasets → EMR Serverless first (typically cheaper than Glue, better performance)
- Team already familiar with Glue Studio visual designer / medium data volumes → Glue ETL Job
- Need Spot instances / custom Hadoop components → EMR on EC2
- Real-time → Managed Flink
AWS Glue ETL
The Two Roles of Glue
Note that Glue is an umbrella product containing several sub-services:
- Glue Data Catalog (covered in Part 4) — metadata
- Glue ETL — Spark job engine
- Glue Crawler — automatic table discovery
- Glue Studio — visual drag-and-drop ETL
- Glue DataBrew — data cleansing UI
- Glue Schema Registry — schema management
This section focuses on Glue ETL.
Glue ETL = Serverless Spark
At its core: managed Apache Spark + Python (PySpark). You write the Spark Job code, and AWS spins up a Spark cluster, runs it, and tears it down.
# A typical Glue Job (PySpark)
from pyspark.sql import SparkSession
from awsglue.context import GlueContext
glueContext = GlueContext(SparkSession.builder.getOrCreate())
# Read from an Iceberg table
df = glueContext.create_data_frame.from_catalog(
database="poc_social_layla",
table_name="ods_event"
).filter("dt = '2026-05-10'")
# Transform
df_clean = df.dropDuplicates(['event_id'])
# Write to DWD Iceberg table
df_clean.writeTo("poc_social_layla.dwd_user_action").append()
Pricing: DPU x Time
DPU = Data Processing Unit = 4 vCPU + 16 GB memory.
Pricing:
- Glue ETL: $0.44 per DPU-hour (1-minute minimum)
- Glue Flex (low priority, 35% cheaper): $0.29 per DPU-hour
- Streaming Job: $0.44 per DPU-hour
Real-world cost example: 30 GB of data per day, 10 DPU x 10 minutes = 1.67 DPU-hours x $0.44 ~ $0.73/day ~ $22/month.
Glue 5.0 (2024-2025 GA) Improvements
Glue 5.0 upgrades Spark to 3.5, natively integrates the latest versions of Iceberg / Delta / Hudi, and introduces:
- Automatic Iceberg compaction / snapshot expiration (S3 Tables / Glue Catalog managed) — no need to write your own OPTIMIZE jobs
- Startup time reduced from ~1 minute to ~30 seconds
- Lake Formation row-/column-level permissions enforced inside Spark
Remaining Pain Points of Glue
- Startup is still slower compared to open-source Spark (not cost-effective for short jobs)
- At equivalent scale, EMR Serverless is typically still 15-25% cheaper
The rule of thumb remains: heavy jobs → EMR Serverless / medium jobs → Glue 5.0 / lightweight SQL → Athena CTAS.
Official documentation:
- Glue ETL: https://docs.aws.amazon.com/glue/
- Writing Iceberg: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html
EMR / EMR Serverless
EMR Version Evolution
| Version | Description |
|---|---|
| EMR on EC2 | Classic: spin up an EC2 cluster running Hadoop / Spark / Hive. Most flexible, cheapest (Spot), but requires ops |
| EMR on EKS | Run Spark on Kubernetes |
| EMR Serverless (2022+) | Fully serverless, AWS manages everything, billed per vCPU + GB-hour |
Our recommendation: EMR Serverless.
EMR Serverless Execution Model
Your Spark Job (PySpark / Scala JAR)
|
v
EMR Serverless Application
(Apache Spark / Hive — your choice)
|
v
AWS automatically spins up workers (elastic on demand)
|
v
Job completes, results land in S3, resources released
Key characteristics:
- Sub-second startup (with pre-initialized capacity mode)
- Billed by exact resource usage (vCPU-hour + Memory-GB-hour + storage)
- No minimum spend
EMR Serverless vs Glue ETL
| EMR Serverless | Glue ETL | |
|---|---|---|
| Price | Slightly lower | Slightly higher |
| Startup time | Seconds (pre-init) / tens of seconds | 1-2 minutes |
| Spark version | Closer to upstream open source | AWS fork, slightly behind |
| Ease of use | Medium | High (Glue Studio visual) |
| Recommendation | Top choice for heavy batch | Team already using it / simple jobs |
Official documentation: EMR Serverless: https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/
Amazon Managed Service for Apache Flink
What Is Flink and Why Not Spark Streaming
Apache Flink is an open-source stream processing engine. Compared to Spark Streaming, it excels at true stream processing:
| Spark Streaming | Flink | |
|---|---|---|
| Model | Micro-batch (second-level) | True streaming (event-level, millisecond-level) |
| State management | Weak | Strong (RocksDB state backend, Savepoints) |
| Exactly-once | Complex | Native support |
| Time semantics | Average | Excellent (event time + watermarks) |
For complex real-time computations (e.g., “3 consecutive failed logins within the last 5 minutes for a user”), Flink provides a far better experience than Spark Streaming.
AWS Managed Flink Service
Old name: Kinesis Data Analytics for Apache Flink
New name (renamed Aug 2023): Amazon Managed Service for Apache Flink
Key features:
- Managed Flink cluster
- Billed per KPU (Kinesis Processing Unit = 1 vCPU + 4 GB)
- Integrates with MSK / KDS / Firehose / DynamoDB / S3
Role in Customer Scenarios: Real-Time Features
MSK Topic: events
|
v
Managed Flink:
- Group by user_id
- Maintain "last 5 clicks" state (RocksDB)
- On each new event → update state → write to DynamoDB
|
v
DynamoDB user_realtime_features:
user_id=12345 → { last_5_clicks: [item_a, item_b, ...] }
|
v
Recommendation service point-queries at inference time (milliseconds)
Practical tips:
- Size KPU count based on throughput (~10,000 events/sec per KPU)
- Enable RocksDB backend when state volume is large
- Set checkpoint interval to 1-5 minutes for fault recovery
Official documentation: https://docs.aws.amazon.com/managed-flink/
AWS Lambda
What Is Lambda
Serverless function compute. You provide code; events trigger execution:
- Maximum duration: 15 minutes per invocation
- Memory: 128 MB to 10 GB
- Billed per millisecond
Role in the Data Pipeline
| Position | Usage |
|---|---|
| Behind API Gateway | Authentication + event enrichment |
| MSK / KDS consumer | Simple real-time processing |
| S3 PUT trigger | Process files immediately on landing |
| EventBridge / Cron | Lightweight periodic tasks |
| Glue / EMR trigger | Kick off downstream jobs |
What Lambda Is Not Good At
- Long tasks (> 15 minutes) → Use ECS / Step Functions
- Large memory (> 10 GB) → Use EMR
- Persistent state → Use DynamoDB / RDS
Pipeline Orchestration: MWAA vs Step Functions
A data warehouse ETL is never just one job. It is dozens of jobs chained together by dependency into a DAG (Directed Acyclic Graph):
Who manages this DAG? Two contenders.
MWAA (Managed Workflows for Apache Airflow)
A managed version of the open-source Apache Airflow (originally created at Airbnb).
How Airflow works: you write Python to describe a DAG.
from airflow import DAG
from airflow.providers.amazon.aws.operators.glue import GlueJobOperator
from airflow.providers.amazon.aws.operators.athena import AthenaOperator
from datetime import datetime
with DAG('daily_warehouse', start_date=datetime(2026,5,1), schedule='0 2 * * *') as dag:
dwd_clean = GlueJobOperator(
task_id='dwd_clean',
job_name='dwd_user_action_clean',
)
dws_aggr = AthenaOperator(
task_id='dws_aggr',
query="INSERT INTO dws_user_daily SELECT ... FROM dwd_user_action ...",
workgroup='poc-social-layla',
)
ads_features = GlueJobOperator(
task_id='ads_features',
job_name='ads_user_features',
)
dwd_clean >> dws_aggr >> ads_features
Strengths:
- Powerful DAG expressiveness (conditional branches, dynamic generation, SubDAGs)
- 200+ Operators (including Glue / EMR / Athena / SageMaker)
- Intuitive Web UI (view DAG status, retry, backfill)
- Retry, SLA, and alerting all built-in
Pain points:
- MWAA has a high floor: an mw1.small base capacity costs ~$300/month; with 1-2 elastic workers it lands at $300-400/month
- Airflow has a learning curve (DAG scheduling concepts, execution_date timezone pitfalls)
- Upgrading Airflow versions is painful
AWS Step Functions
Fully AWS-native, billed per state transition ($25 per million state transitions).
Workflows are described in JSON (ASL = Amazon States Language):
{
"StartAt": "DWD",
"States": {
"DWD": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {"JobName": "dwd_user_action_clean"},
"Next": "DWS"
},
"DWS": {
"Type": "Task",
"Resource": "arn:aws:states:::athena:startQueryExecution.sync",
"Parameters": {"QueryString": "INSERT INTO ..."},
"Next": "ADS"
},
"ADS": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {"JobName": "ads_user_features"},
"End": true
}
}
}
Strengths:
- Fully serverless, pay-per-use
- Integrates with 200+ AWS services (call directly, no Lambda wrapper needed)
- Visual DAG (view each step’s status in real-time during execution)
- Supports error retry, parallel branches, Map State
Pain points:
- DAG expressiveness is less flexible than Airflow (dynamic DAGs, complex conditional branches are weaker)
- ASL JSON becomes hard to maintain when large (recommend generating with CDK / Terraform)
How to Choose
| Scenario | Pick |
|---|---|
| Warehouse batch ETL, 10+ tasks | MWAA (mature Airflow ecosystem) |
| Warehouse batch ETL, simple 5-20 tasks | Step Functions (cheap, serverless) |
| Cross-team complex scheduling, need Web UI | MWAA |
| Single business line, no ops desired | Step Functions |
| Hybrid: MWAA as master scheduler + Step Functions for sub-workflows | Both |
Official documentation:
- MWAA: https://docs.aws.amazon.com/mwaa/
- Step Functions: https://docs.aws.amazon.com/step-functions/
Customer Scenario: Orchestration Example
Daily 02:00 (UTC+8): MWAA DAG kicks off
+-- 02:00 ods_user_full_load_check (prerequisite: DMS / Zero-ETL completed for the day)
+-- 02:30 dwd_user_action_clean (Glue Job)
+-- 02:30 dwd_post_enrich (Glue Job)
+-- 03:00 dws_user_daily (Athena CTAS)
+-- 03:30 ads_user_features (EMR Serverless)
+-- 03:30 ads_sample_follow (EMR Serverless)
+-- 04:00 sync_to_dynamodb (Glue Job writes to DynamoDB)
+-- 04:30 train_recall_model (SageMaker Training Job)
+-- 04:30 train_rank_model (SageMaker Training Job)
+-- 05:30 deploy_endpoint (Lambda calls SageMaker API)
+-- 06:00 dq_check_report (Slack alert / email)
Each step:
- Auto-retry 2 times on failure
- Still failing → page on-call engineer
- Overall DAG SLA: 7 hours
Chapter Summary
| Service | One-Liner |
|---|---|
| Athena CTAS | Cheapest option for simple SQL transformations |
| Glue ETL | Managed Spark with Studio visual editor |
| EMR Serverless | Heavy batch processing — cheaper and faster |
| Managed Flink | Real-time stream processing |
| Lambda | Short tasks / enrichment / triggers |
| MWAA | Complex DAG scheduling with the Airflow ecosystem |
| Step Functions | Simple DAGs, serverless, AWS-native |
This concludes the compute and orchestration layer. In the next chapter, we stitch together the services from Parts 3-5 and draw the end-to-end data pipeline for our customer scenario.