Big Data on AWS Deep Dive (Part 5): EMR, Glue ETL, Flink, and Pipeline Orchestration

Compare EMR Serverless, Glue ETL, Managed Flink, and choose the right compute engine. Then orchestrate data pipelines with MWAA (Airflow) and Step Functions.

zhuermu · · 14 min
big-dataawsemrglue-etlflinkmwaaairflowstep-functionsdata-pipeline

Once data lands in S3 and is registered in the Glue Catalog, the real work begins: layered ETL processing, stream processing, and scheduling.

Services covered in this chapter: EMR Serverless / AWS Glue ETL / Managed Flink / Lambda / MWAA / Step Functions.


Compute Engine Landscape

There are many options for “computing things” in the data lake. Here is a classification:

CategoryServiceBest For
SQL Engine (lightweight)Athena CTAS / INSERTTransformations expressible in simple SQL
Spark on ServerlessAWS Glue ETL / EMR ServerlessMedium-to-heavy batch processing
Spark on EC2EMR on EC2Highly customized / extreme Spot savings
Stream ProcessingAmazon Managed Service for Apache FlinkReal-time features / real-time fraud detection
Lightweight FunctionsAWS LambdaShort tasks / small data / enrichment

How to choose:

  • A single SQL statement can do it all → Athena CTAS (cheapest)
  • Need Python UDFs / complex transformations / large datasets → EMR Serverless first (typically cheaper than Glue, better performance)
  • Team already familiar with Glue Studio visual designer / medium data volumes → Glue ETL Job
  • Need Spot instances / custom Hadoop components → EMR on EC2
  • Real-time → Managed Flink

AWS Glue ETL

The Two Roles of Glue

Note that Glue is an umbrella product containing several sub-services:

  • Glue Data Catalog (covered in Part 4) — metadata
  • Glue ETL — Spark job engine
  • Glue Crawler — automatic table discovery
  • Glue Studio — visual drag-and-drop ETL
  • Glue DataBrew — data cleansing UI
  • Glue Schema Registry — schema management

This section focuses on Glue ETL.

Glue ETL = Serverless Spark

At its core: managed Apache Spark + Python (PySpark). You write the Spark Job code, and AWS spins up a Spark cluster, runs it, and tears it down.

# A typical Glue Job (PySpark)
from pyspark.sql import SparkSession
from awsglue.context import GlueContext

glueContext = GlueContext(SparkSession.builder.getOrCreate())

# Read from an Iceberg table
df = glueContext.create_data_frame.from_catalog(
    database="poc_social_layla",
    table_name="ods_event"
).filter("dt = '2026-05-10'")

# Transform
df_clean = df.dropDuplicates(['event_id'])

# Write to DWD Iceberg table
df_clean.writeTo("poc_social_layla.dwd_user_action").append()

Pricing: DPU x Time

DPU = Data Processing Unit = 4 vCPU + 16 GB memory.

Pricing:

  • Glue ETL: $0.44 per DPU-hour (1-minute minimum)
  • Glue Flex (low priority, 35% cheaper): $0.29 per DPU-hour
  • Streaming Job: $0.44 per DPU-hour

Real-world cost example: 30 GB of data per day, 10 DPU x 10 minutes = 1.67 DPU-hours x $0.44 ~ $0.73/day ~ $22/month.

Glue 5.0 (2024-2025 GA) Improvements

Glue 5.0 upgrades Spark to 3.5, natively integrates the latest versions of Iceberg / Delta / Hudi, and introduces:

  • Automatic Iceberg compaction / snapshot expiration (S3 Tables / Glue Catalog managed) — no need to write your own OPTIMIZE jobs
  • Startup time reduced from ~1 minute to ~30 seconds
  • Lake Formation row-/column-level permissions enforced inside Spark

Remaining Pain Points of Glue

  • Startup is still slower compared to open-source Spark (not cost-effective for short jobs)
  • At equivalent scale, EMR Serverless is typically still 15-25% cheaper

The rule of thumb remains: heavy jobs → EMR Serverless / medium jobs → Glue 5.0 / lightweight SQL → Athena CTAS.

Official documentation:


EMR / EMR Serverless

EMR Version Evolution

VersionDescription
EMR on EC2Classic: spin up an EC2 cluster running Hadoop / Spark / Hive. Most flexible, cheapest (Spot), but requires ops
EMR on EKSRun Spark on Kubernetes
EMR Serverless (2022+)Fully serverless, AWS manages everything, billed per vCPU + GB-hour

Our recommendation: EMR Serverless.

EMR Serverless Execution Model

Your Spark Job (PySpark / Scala JAR)
    |
    v
 EMR Serverless Application
   (Apache Spark / Hive — your choice)
    |
    v
 AWS automatically spins up workers (elastic on demand)
    |
    v
 Job completes, results land in S3, resources released

Key characteristics:

  • Sub-second startup (with pre-initialized capacity mode)
  • Billed by exact resource usage (vCPU-hour + Memory-GB-hour + storage)
  • No minimum spend

EMR Serverless vs Glue ETL

EMR ServerlessGlue ETL
PriceSlightly lowerSlightly higher
Startup timeSeconds (pre-init) / tens of seconds1-2 minutes
Spark versionCloser to upstream open sourceAWS fork, slightly behind
Ease of useMediumHigh (Glue Studio visual)
RecommendationTop choice for heavy batchTeam already using it / simple jobs

Official documentation: EMR Serverless: https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/


Apache Flink is an open-source stream processing engine. Compared to Spark Streaming, it excels at true stream processing:

Spark StreamingFlink
ModelMicro-batch (second-level)True streaming (event-level, millisecond-level)
State managementWeakStrong (RocksDB state backend, Savepoints)
Exactly-onceComplexNative support
Time semanticsAverageExcellent (event time + watermarks)

For complex real-time computations (e.g., “3 consecutive failed logins within the last 5 minutes for a user”), Flink provides a far better experience than Spark Streaming.

Old name: Kinesis Data Analytics for Apache Flink
New name (renamed Aug 2023): Amazon Managed Service for Apache Flink

Key features:

  • Managed Flink cluster
  • Billed per KPU (Kinesis Processing Unit = 1 vCPU + 4 GB)
  • Integrates with MSK / KDS / Firehose / DynamoDB / S3

Role in Customer Scenarios: Real-Time Features

MSK Topic: events
    |
    v
Managed Flink:
  - Group by user_id
  - Maintain "last 5 clicks" state (RocksDB)
  - On each new event → update state → write to DynamoDB
    |
    v
DynamoDB user_realtime_features:
  user_id=12345 → { last_5_clicks: [item_a, item_b, ...] }
    |
    v
Recommendation service point-queries at inference time (milliseconds)

Practical tips:

  • Size KPU count based on throughput (~10,000 events/sec per KPU)
  • Enable RocksDB backend when state volume is large
  • Set checkpoint interval to 1-5 minutes for fault recovery

Official documentation: https://docs.aws.amazon.com/managed-flink/


AWS Lambda

What Is Lambda

Serverless function compute. You provide code; events trigger execution:

  • Maximum duration: 15 minutes per invocation
  • Memory: 128 MB to 10 GB
  • Billed per millisecond

Role in the Data Pipeline

PositionUsage
Behind API GatewayAuthentication + event enrichment
MSK / KDS consumerSimple real-time processing
S3 PUT triggerProcess files immediately on landing
EventBridge / CronLightweight periodic tasks
Glue / EMR triggerKick off downstream jobs

What Lambda Is Not Good At

  • Long tasks (> 15 minutes) → Use ECS / Step Functions
  • Large memory (> 10 GB) → Use EMR
  • Persistent state → Use DynamoDB / RDS

Pipeline Orchestration: MWAA vs Step Functions

A data warehouse ETL is never just one job. It is dozens of jobs chained together by dependency into a DAG (Directed Acyclic Graph):

Orchestration DAG

Who manages this DAG? Two contenders.

MWAA (Managed Workflows for Apache Airflow)

A managed version of the open-source Apache Airflow (originally created at Airbnb).

How Airflow works: you write Python to describe a DAG.

from airflow import DAG
from airflow.providers.amazon.aws.operators.glue import GlueJobOperator
from airflow.providers.amazon.aws.operators.athena import AthenaOperator
from datetime import datetime

with DAG('daily_warehouse', start_date=datetime(2026,5,1), schedule='0 2 * * *') as dag:
    
    dwd_clean = GlueJobOperator(
        task_id='dwd_clean',
        job_name='dwd_user_action_clean',
    )
    
    dws_aggr = AthenaOperator(
        task_id='dws_aggr',
        query="INSERT INTO dws_user_daily SELECT ... FROM dwd_user_action ...",
        workgroup='poc-social-layla',
    )
    
    ads_features = GlueJobOperator(
        task_id='ads_features',
        job_name='ads_user_features',
    )
    
    dwd_clean >> dws_aggr >> ads_features

Strengths:

  • Powerful DAG expressiveness (conditional branches, dynamic generation, SubDAGs)
  • 200+ Operators (including Glue / EMR / Athena / SageMaker)
  • Intuitive Web UI (view DAG status, retry, backfill)
  • Retry, SLA, and alerting all built-in

Pain points:

  • MWAA has a high floor: an mw1.small base capacity costs ~$300/month; with 1-2 elastic workers it lands at $300-400/month
  • Airflow has a learning curve (DAG scheduling concepts, execution_date timezone pitfalls)
  • Upgrading Airflow versions is painful

AWS Step Functions

Fully AWS-native, billed per state transition ($25 per million state transitions).

Workflows are described in JSON (ASL = Amazon States Language):

{
  "StartAt": "DWD",
  "States": {
    "DWD": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {"JobName": "dwd_user_action_clean"},
      "Next": "DWS"
    },
    "DWS": {
      "Type": "Task",
      "Resource": "arn:aws:states:::athena:startQueryExecution.sync",
      "Parameters": {"QueryString": "INSERT INTO ..."},
      "Next": "ADS"
    },
    "ADS": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {"JobName": "ads_user_features"},
      "End": true
    }
  }
}

Strengths:

  • Fully serverless, pay-per-use
  • Integrates with 200+ AWS services (call directly, no Lambda wrapper needed)
  • Visual DAG (view each step’s status in real-time during execution)
  • Supports error retry, parallel branches, Map State

Pain points:

  • DAG expressiveness is less flexible than Airflow (dynamic DAGs, complex conditional branches are weaker)
  • ASL JSON becomes hard to maintain when large (recommend generating with CDK / Terraform)

How to Choose

ScenarioPick
Warehouse batch ETL, 10+ tasksMWAA (mature Airflow ecosystem)
Warehouse batch ETL, simple 5-20 tasksStep Functions (cheap, serverless)
Cross-team complex scheduling, need Web UIMWAA
Single business line, no ops desiredStep Functions
Hybrid: MWAA as master scheduler + Step Functions for sub-workflowsBoth

Official documentation:


Customer Scenario: Orchestration Example

Daily 02:00 (UTC+8): MWAA DAG kicks off
  +-- 02:00 ods_user_full_load_check (prerequisite: DMS / Zero-ETL completed for the day)
  +-- 02:30 dwd_user_action_clean (Glue Job)
  +-- 02:30 dwd_post_enrich       (Glue Job)
  +-- 03:00 dws_user_daily        (Athena CTAS)
  +-- 03:30 ads_user_features     (EMR Serverless)
  +-- 03:30 ads_sample_follow     (EMR Serverless)
  +-- 04:00 sync_to_dynamodb      (Glue Job writes to DynamoDB)
  +-- 04:30 train_recall_model    (SageMaker Training Job)
  +-- 04:30 train_rank_model      (SageMaker Training Job)
  +-- 05:30 deploy_endpoint       (Lambda calls SageMaker API)
  +-- 06:00 dq_check_report       (Slack alert / email)

Each step:
- Auto-retry 2 times on failure
- Still failing → page on-call engineer
- Overall DAG SLA: 7 hours

Chapter Summary

ServiceOne-Liner
Athena CTASCheapest option for simple SQL transformations
Glue ETLManaged Spark with Studio visual editor
EMR ServerlessHeavy batch processing — cheaper and faster
Managed FlinkReal-time stream processing
LambdaShort tasks / enrichment / triggers
MWAAComplex DAG scheduling with the Airflow ecosystem
Step FunctionsSimple DAGs, serverless, AWS-native

This concludes the compute and orchestration layer. In the next chapter, we stitch together the services from Parts 3-5 and draw the end-to-end data pipeline for our customer scenario.