When should I use EMR Serverless vs Glue ETL?

Use Glue ETL for standard Spark ETL jobs with auto-scaling and no cluster management. Use EMR Serverless when you need full Spark/Hive/Presto control, custom libraries, or workloads that benefit from EMR's optimizations.

Should I use MWAA (Airflow) or Step Functions for pipeline orchestration?

Use MWAA for complex DAGs with 10+ tasks, cross-team dependencies, and backfill needs. Use Step Functions for simpler workflows (5-20 steps) that benefit from fully serverless, pay-per-transition pricing.

Big Data on AWS Deep Dive (Part 5): EMR, Glue ETL, Flink, and Pipeline Orchestration

Once data lands in S3 and is registered in the Glue Catalog, the real work begins: layered ETL processing, stream processing, and scheduling.

Services covered in this chapter: EMR Serverless / AWS Glue ETL / Managed Flink / Lambda / MWAA / Step Functions.

Compute Engine Landscape

There are many options for “computing things” in the data lake. Here is a classification:

Category	Service	Best For
SQL Engine (lightweight)	Athena CTAS / INSERT	Transformations expressible in simple SQL
Spark on Serverless	AWS Glue ETL / EMR Serverless	Medium-to-heavy batch processing
Spark on EC2	EMR on EC2	Highly customized / extreme Spot savings
Stream Processing	Amazon Managed Service for Apache Flink	Real-time features / real-time fraud detection
Lightweight Functions	AWS Lambda	Short tasks / small data / enrichment

How to choose:

A single SQL statement can do it all → Athena CTAS (cheapest)
Need Python UDFs / complex transformations / large datasets → EMR Serverless first (typically cheaper than Glue, better performance)
Team already familiar with Glue Studio visual designer / medium data volumes → Glue ETL Job
Need Spot instances / custom Hadoop components → EMR on EC2
Real-time → Managed Flink

AWS Glue ETL

The Two Roles of Glue

Note that Glue is an umbrella product containing several sub-services:

Glue Data Catalog (covered in Part 4) — metadata
Glue ETL — Spark job engine
Glue Crawler — automatic table discovery
Glue Studio — visual drag-and-drop ETL
Glue DataBrew — data cleansing UI
Glue Schema Registry — schema management

This section focuses on Glue ETL.

Glue ETL = Serverless Spark

At its core: managed Apache Spark + Python (PySpark). You write the Spark Job code, and AWS spins up a Spark cluster, runs it, and tears it down.

# A typical Glue Job (PySpark)
from pyspark.sql import SparkSession
from awsglue.context import GlueContext

glueContext = GlueContext(SparkSession.builder.getOrCreate())

# Read from an Iceberg table
df = glueContext.create_data_frame.from_catalog(
    database="poc_social_layla",
    table_name="ods_event"
).filter("dt = '2026-05-10'")

# Transform
df_clean = df.dropDuplicates(['event_id'])

# Write to DWD Iceberg table
df_clean.writeTo("poc_social_layla.dwd_user_action").append()

Pricing: DPU x Time

DPU = Data Processing Unit = 4 vCPU + 16 GB memory.

Pricing:

Glue ETL: $0.44 per DPU-hour (1-minute minimum)
Glue Flex (low priority, 35% cheaper): $0.29 per DPU-hour
Streaming Job: $0.44 per DPU-hour

Real-world cost example: 30 GB of data per day, 10 DPU x 10 minutes = 1.67 DPU-hours x $0.44 ~ $0.73/day ~ $22/month.

Glue 5.0 (2024-2025 GA) Improvements

Glue 5.0 upgrades Spark to 3.5, natively integrates the latest versions of Iceberg / Delta / Hudi, and introduces:

Automatic Iceberg compaction / snapshot expiration (S3 Tables / Glue Catalog managed) — no need to write your own OPTIMIZE jobs
Startup time reduced from ~1 minute to ~30 seconds
Lake Formation row-/column-level permissions enforced inside Spark

Remaining Pain Points of Glue

Startup is still slower compared to open-source Spark (not cost-effective for short jobs)
At equivalent scale, EMR Serverless is typically still 15-25% cheaper

The rule of thumb remains: heavy jobs → EMR Serverless / medium jobs → Glue 5.0 / lightweight SQL → Athena CTAS.

Official documentation:

Glue ETL: https://docs.aws.amazon.com/glue/
Writing Iceberg: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html

EMR / EMR Serverless

EMR Version Evolution

Version	Description
EMR on EC2	Classic: spin up an EC2 cluster running Hadoop / Spark / Hive. Most flexible, cheapest (Spot), but requires ops
EMR on EKS	Run Spark on Kubernetes
EMR Serverless (2022+)	Fully serverless, AWS manages everything, billed per vCPU + GB-hour

Our recommendation: EMR Serverless.

EMR Serverless Execution Model

Your Spark Job (PySpark / Scala JAR)
    |
    v
 EMR Serverless Application
   (Apache Spark / Hive — your choice)
    |
    v
 AWS automatically spins up workers (elastic on demand)
    |
    v
 Job completes, results land in S3, resources released

Key characteristics:

Sub-second startup (with pre-initialized capacity mode)
Billed by exact resource usage (vCPU-hour + Memory-GB-hour + storage)
No minimum spend

EMR Serverless vs Glue ETL

	EMR Serverless	Glue ETL
Price	Slightly lower	Slightly higher
Startup time	Seconds (pre-init) / tens of seconds	1-2 minutes
Spark version	Closer to upstream open source	AWS fork, slightly behind
Ease of use	Medium	High (Glue Studio visual)
Recommendation	Top choice for heavy batch	Team already using it / simple jobs

Official documentation: EMR Serverless: https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/

Amazon Managed Service for Apache Flink

What Is Flink and Why Not Spark Streaming

Apache Flink is an open-source stream processing engine. Compared to Spark Streaming, it excels at true stream processing:

	Spark Streaming	Flink
Model	Micro-batch (second-level)	True streaming (event-level, millisecond-level)
State management	Weak	Strong (RocksDB state backend, Savepoints)
Exactly-once	Complex	Native support
Time semantics	Average	Excellent (event time + watermarks)

For complex real-time computations (e.g., “3 consecutive failed logins within the last 5 minutes for a user”), Flink provides a far better experience than Spark Streaming.

AWS Managed Flink Service

Old name: Kinesis Data Analytics for Apache Flink
New name (renamed Aug 2023): Amazon Managed Service for Apache Flink

Key features:

Managed Flink cluster
Billed per KPU (Kinesis Processing Unit = 1 vCPU + 4 GB)
Integrates with MSK / KDS / Firehose / DynamoDB / S3

Role in Customer Scenarios: Real-Time Features

MSK Topic: events
    |
    v
Managed Flink:
  - Group by user_id
  - Maintain "last 5 clicks" state (RocksDB)
  - On each new event → update state → write to DynamoDB
    |
    v
DynamoDB user_realtime_features:
  user_id=12345 → { last_5_clicks: [item_a, item_b, ...] }
    |
    v
Recommendation service point-queries at inference time (milliseconds)

Practical tips:

Size KPU count based on throughput (~10,000 events/sec per KPU)
Enable RocksDB backend when state volume is large
Set checkpoint interval to 1-5 minutes for fault recovery

Official documentation: https://docs.aws.amazon.com/managed-flink/

AWS Lambda

What Is Lambda

Serverless function compute. You provide code; events trigger execution:

Maximum duration: 15 minutes per invocation
Memory: 128 MB to 10 GB
Billed per millisecond

Role in the Data Pipeline

Position	Usage
Behind API Gateway	Authentication + event enrichment
MSK / KDS consumer	Simple real-time processing
S3 PUT trigger	Process files immediately on landing
EventBridge / Cron	Lightweight periodic tasks
Glue / EMR trigger	Kick off downstream jobs

What Lambda Is Not Good At

Long tasks (> 15 minutes) → Use ECS / Step Functions
Large memory (> 10 GB) → Use EMR
Persistent state → Use DynamoDB / RDS

Pipeline Orchestration: MWAA vs Step Functions

A data warehouse ETL is never just one job. It is dozens of jobs chained together by dependency into a DAG (Directed Acyclic Graph):

Orchestration DAG

Who manages this DAG? Two contenders.

MWAA (Managed Workflows for Apache Airflow)

A managed version of the open-source Apache Airflow (originally created at Airbnb).

How Airflow works: you write Python to describe a DAG.

from airflow import DAG
from airflow.providers.amazon.aws.operators.glue import GlueJobOperator
from airflow.providers.amazon.aws.operators.athena import AthenaOperator
from datetime import datetime

with DAG('daily_warehouse', start_date=datetime(2026,5,1), schedule='0 2 * * *') as dag:
    
    dwd_clean = GlueJobOperator(
        task_id='dwd_clean',
        job_name='dwd_user_action_clean',
    )
    
    dws_aggr = AthenaOperator(
        task_id='dws_aggr',
        query="INSERT INTO dws_user_daily SELECT ... FROM dwd_user_action ...",
        workgroup='poc-social-layla',
    )
    
    ads_features = GlueJobOperator(
        task_id='ads_features',
        job_name='ads_user_features',
    )
    
    dwd_clean >> dws_aggr >> ads_features

Strengths:

Powerful DAG expressiveness (conditional branches, dynamic generation, SubDAGs)
200+ Operators (including Glue / EMR / Athena / SageMaker)
Intuitive Web UI (view DAG status, retry, backfill)
Retry, SLA, and alerting all built-in

Pain points:

MWAA has a high floor: an mw1.small base capacity costs ~$300/month; with 1-2 elastic workers it lands at $300-400/month
Airflow has a learning curve (DAG scheduling concepts, execution_date timezone pitfalls)
Upgrading Airflow versions is painful

AWS Step Functions

Fully AWS-native, billed per state transition ($25 per million state transitions).

Workflows are described in JSON (ASL = Amazon States Language):

{
  "StartAt": "DWD",
  "States": {
    "DWD": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {"JobName": "dwd_user_action_clean"},
      "Next": "DWS"
    },
    "DWS": {
      "Type": "Task",
      "Resource": "arn:aws:states:::athena:startQueryExecution.sync",
      "Parameters": {"QueryString": "INSERT INTO ..."},
      "Next": "ADS"
    },
    "ADS": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {"JobName": "ads_user_features"},
      "End": true
    }
  }
}

Strengths:

Fully serverless, pay-per-use
Integrates with 200+ AWS services (call directly, no Lambda wrapper needed)
Visual DAG (view each step’s status in real-time during execution)
Supports error retry, parallel branches, Map State

Pain points:

DAG expressiveness is less flexible than Airflow (dynamic DAGs, complex conditional branches are weaker)
ASL JSON becomes hard to maintain when large (recommend generating with CDK / Terraform)

How to Choose

Scenario	Pick
Warehouse batch ETL, 10+ tasks	MWAA (mature Airflow ecosystem)
Warehouse batch ETL, simple 5-20 tasks	Step Functions (cheap, serverless)
Cross-team complex scheduling, need Web UI	MWAA
Single business line, no ops desired	Step Functions
Hybrid: MWAA as master scheduler + Step Functions for sub-workflows	Both

Official documentation:

MWAA: https://docs.aws.amazon.com/mwaa/
Step Functions: https://docs.aws.amazon.com/step-functions/

Customer Scenario: Orchestration Example

Daily 02:00 (UTC+8): MWAA DAG kicks off
  +-- 02:00 ods_user_full_load_check (prerequisite: DMS / Zero-ETL completed for the day)
  +-- 02:30 dwd_user_action_clean (Glue Job)
  +-- 02:30 dwd_post_enrich       (Glue Job)
  +-- 03:00 dws_user_daily        (Athena CTAS)
  +-- 03:30 ads_user_features     (EMR Serverless)
  +-- 03:30 ads_sample_follow     (EMR Serverless)
  +-- 04:00 sync_to_dynamodb      (Glue Job writes to DynamoDB)
  +-- 04:30 train_recall_model    (SageMaker Training Job)
  +-- 04:30 train_rank_model      (SageMaker Training Job)
  +-- 05:30 deploy_endpoint       (Lambda calls SageMaker API)
  +-- 06:00 dq_check_report       (Slack alert / email)

Each step:
- Auto-retry 2 times on failure
- Still failing → page on-call engineer
- Overall DAG SLA: 7 hours

Chapter Summary

Service	One-Liner
Athena CTAS	Cheapest option for simple SQL transformations
Glue ETL	Managed Spark with Studio visual editor
EMR Serverless	Heavy batch processing — cheaper and faster
Managed Flink	Real-time stream processing
Lambda	Short tasks / enrichment / triggers
MWAA	Complex DAG scheduling with the Airflow ecosystem
Step Functions	Simple DAGs, serverless, AWS-native

This concludes the compute and orchestration layer. In the next chapter, we stitch together the services from Parts 3-5 and draw the end-to-end data pipeline for our customer scenario.