AWS Glue Local Development Workflow¶

Complete guide for developing and testing AWS Glue 5.0 jobs locally using Docker, Jupyter, Neovim, and a proper test pyramid.

Overview¶

This workflow enables 100% local development of AWS Glue jobs without AWS Glue Interactive Sessions costs, while maintaining a professional testing strategy from unit tests to end-to-end validation.

Key Components:

Jupyter in Glue Docker - Interactive PySpark development locally
Neovim + Molten - Edit notebooks in Neovim with your dotfiles
Three-level testing - Unit → Integration → E2E
Testable code structure - Separate business logic from Glue boilerplate

Architecture¶

Development Flow:
1. Interactive development (Neovim + local Jupyter in Glue Docker)
   ↓
2. Extract functions to testable modules
   ↓
3. Unit tests (pure PySpark, no Glue - fast)
   ↓
4. Integration tests (DynamicFrame, in Glue Docker - medium)
   ↓
5. E2E tests (real AWS Glue jobs - slow but thorough)

Setup¶

Prerequisites¶

# Install Jupyter client (for kernel management)
pip install jupyter-client pynvim

# Ensure molten-nvim is installed (see Neovim config)
# Plugin file: platforms/common/.config/nvim/lua/plugins/molten.lua

Project Structure¶

Create this structure for your Glue project:

glue-project/
├── docker-compose.yml           # Glue container with Jupyter
├── Makefile                     # Helper commands
│
├── glue_jobs/                   # Job scripts
│   ├── lib/                     # Testable modules
│   │   ├── __init__.py
│   │   ├── transformations.py  # Pure transformation logic
│   │   ├── validators.py       # Data quality checks
│   │   └── utils.py            # Helper functions
│   │
│   ├── customer_etl.py         # Thin Glue job wrapper
│   └── product_etl.py
│
├── tests/
│   ├── conftest.py             # Pytest fixtures
│   ├── unit/                   # Fast unit tests (no Glue)
│   │   ├── test_transformations.py
│   │   └── test_validators.py
│   ├── integration/            # Integration tests (needs Glue Docker)
│   │   └── test_dynamicframe_ops.py
│   └── e2e/                    # E2E tests (real Glue)
│       └── test_customer_etl.py
│
├── notebooks/                  # Interactive development
│   └── customer_analysis.py   # Edit with Neovim + molten
│
├── test_data/                  # Local test data
│   ├── input/
│   └── output/
│
└── pytest.ini

Docker Compose Configuration¶

docker-compose.yml:

version: '3.8'

services:
  glue-jupyter:
    image: public.ecr.aws/glue/aws-glue-libs:glue_libs_5.0.0_image_01
    container_name: glue-jupyter
    ports:
      - "8888:8888"   # Jupyter
      - "4040:4040"   # Spark UI
      - "18080:18080" # Spark History Server
    volumes:
      # Project files
      - ./glue_jobs:/home/hadoop/workspace/glue_jobs
      - ./tests:/home/hadoop/workspace/tests
      - ./notebooks:/home/hadoop/workspace/notebooks
      - ./test_data:/home/hadoop/workspace/test_data

      # AWS credentials (for S3 access if needed)
      - ~/.aws:/home/hadoop/.aws:ro
    working_dir: /home/hadoop/workspace
    environment:
      - DISABLE_SSL=true
      - AWS_REGION=us-east-1
      - JUPYTER_ENABLE_LAB=yes
      - PYTHONPATH=/home/hadoop/workspace
    user: root
    command: >
      bash -c "
        pip3 install --upgrade pip &&
        pip3 install jupyter jupyterlab ipykernel pytest pytest-mock boto3 awswrangler &&

        jupyter notebook --generate-config &&

        echo \"c.NotebookApp.ip = '0.0.0.0'\" >> /root/.jupyter/jupyter_notebook_config.py &&
        echo \"c.NotebookApp.allow_root = True\" >> /root/.jupyter/jupyter_notebook_config.py &&
        echo \"c.NotebookApp.token = ''\" >> /root/.jupyter/jupyter_notebook_config.py &&
        echo \"c.NotebookApp.password = ''\" >> /root/.jupyter/jupyter_notebook_config.py &&

        echo 'Starting Jupyter Lab on http://localhost:8888' &&
        jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
      "

Makefile Helper Commands¶

Makefile:

.PHONY: jupyter-start jupyter-logs jupyter-stop shell pyspark test-unit test-integration test-e2e test-all

# Start Jupyter in Glue container
jupyter-start:
 docker-compose up -d
 @echo "🚀 Jupyter running at http://localhost:8888"
 @echo "📊 Spark UI at http://localhost:4040"

# View Jupyter logs
jupyter-logs:
 docker-compose logs -f glue-jupyter

# Stop Jupyter
jupyter-stop:
 docker-compose down

# Get a shell in the container
shell:
 docker-compose exec glue-jupyter bash

# Start interactive PySpark shell
pyspark:
 docker-compose exec glue-jupyter pyspark

# Fast unit tests (no Docker needed)
test-unit:
 pytest tests/unit/ -v

# Integration tests (DynamicFrame, in Docker)
test-integration:
 docker-compose exec glue-jupyter pytest tests/integration/ -v

# E2E tests (real Glue)
test-e2e:
 pytest tests/e2e/ -v --log-cli-level=INFO

# All tests
test-all: test-unit test-integration test-e2e

Code Structure¶

Testable Transformation Module¶

glue_jobs/lib/transformations.py:

"""
Transformation functions for customer data.
Functions work with both DataFrame and DynamicFrame for flexibility.
"""

from pyspark.sql import DataFrame
from awsglue.dynamicframe import DynamicFrame
from awsglue.context import GlueContext


def filter_active_customers_df(df: DataFrame) -> DataFrame:
    """
    Filter active customers (pure PySpark - easily testable).

    Args:
        df: Input DataFrame with customer data

    Returns:
        DataFrame with only active customers
    """
    return df.filter(df.status == "active")


def add_customer_tier_df(df: DataFrame) -> DataFrame:
    """
    Add customer tier based on total_spent (pure PySpark).

    Testable without Glue context!
    """
    from pyspark.sql.functions import when, col

    return df.withColumn(
        "customer_tier",
        when(col("total_spent") >= 10000, "platinum")
        .when(col("total_spent") >= 5000, "gold")
        .when(col("total_spent") >= 1000, "silver")
        .otherwise("bronze")
    )


def transform_customer_data_dyf(
    dyf: DynamicFrame,
    glue_context: GlueContext
) -> DynamicFrame:
    """
    Transform customer data using DynamicFrame (Glue-specific).

    This uses DynamicFrame but still testable in Glue Docker.
    """
    from awsglue.transforms import ApplyMapping, Filter

    # Filter using DynamicFrame
    filtered = Filter.apply(
        frame=dyf,
        f=lambda x: x["status"] == "active"
    )

    # Apply schema mapping
    mapped = ApplyMapping.apply(
        frame=filtered,
        mappings=[
            ("customer_id", "string", "customer_id", "string"),
            ("name", "string", "full_name", "string"),
            ("email", "string", "email", "string"),
            ("total_spent", "double", "total_spent", "double"),
            ("created_at", "string", "created_date", "timestamp"),
        ]
    )

    # Convert to DataFrame, add tier, convert back
    df = mapped.toDF()
    df_with_tier = add_customer_tier_df(df)

    return DynamicFrame.fromDF(df_with_tier, glue_context, "with_tier")

Thin Glue Job Wrapper¶

glue_jobs/customer_etl.py:

"""
Customer ETL Glue Job.

Thin wrapper - all logic in lib/transformations.py for testing.
"""

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

from lib.transformations import transform_customer_data_dyf


def main():
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])

    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)

    # Load from Glue Catalog (not testable locally - that's OK)
    datasource = glueContext.create_dynamic_frame.from_catalog(
        database="customers_db",
        table_name="customers_raw"
    )

    # Transform (testable function!)
    transformed = transform_customer_data_dyf(datasource, glueContext)

    # Write to Glue Catalog (not testable locally - that's OK)
    glueContext.write_dynamic_frame.from_catalog(
        frame=transformed,
        database="customers_db",
        table_name="customers_processed"
    )

    job.commit()


if __name__ == "__main__":
    main()

Testing Strategy¶

Test Pyramid¶

        E2E Tests (Real Glue)
        • Full job execution
        • Real S3, Glue Catalog
        • Slow (2-3 min)
        • Run before PR
       /                    \
      /  Integration Tests   \
     /   (Glue Docker)         \
    /    • DynamicFrame ops     \
   /     • GlueContext needed    \
  /      • Medium (30 sec)        \
 /       • Run frequently          \
/___________________________________\
    Unit Tests (Pure PySpark)
    • Pure transformation logic
    • No Glue dependencies
    • Fast (seconds)
    • Run on every save

Level 1: Unit Tests (Pure PySpark)¶

tests/unit/test_transformations.py:

"""
Unit tests for transformation functions.
These run FAST (no Glue Docker needed).
"""

import pytest
from pyspark.sql import Row
from glue_jobs.lib.transformations import (
    filter_active_customers_df,
    add_customer_tier_df
)


def test_filter_active_customers(spark_session):
    """Test filtering active customers (pure PySpark)"""
    # Arrange
    data = [
        Row(customer_id="1", status="active", name="Alice"),
        Row(customer_id="2", status="inactive", name="Bob"),
        Row(customer_id="3", status="active", name="Charlie"),
    ]
    df = spark_session.createDataFrame(data)

    # Act
    result = filter_active_customers_df(df)

    # Assert
    assert result.count() == 2
    names = [row.name for row in result.collect()]
    assert "Alice" in names
    assert "Charlie" in names
    assert "Bob" not in names


def test_customer_tier_assignment(spark_session):
    """Test customer tier logic (pure PySpark)"""
    data = [
        Row(customer_id="1", total_spent=15000.0),  # platinum
        Row(customer_id="2", total_spent=7000.0),   # gold
        Row(customer_id="3", total_spent=2000.0),   # silver
        Row(customer_id="4", total_spent=500.0),    # bronze
    ]
    df = spark_session.createDataFrame(data)

    result = add_customer_tier_df(df)

    tiers = {row.customer_id: row.customer_tier for row in result.collect()}
    assert tiers["1"] == "platinum"
    assert tiers["2"] == "gold"
    assert tiers["3"] == "silver"
    assert tiers["4"] == "bronze"

Level 2: Integration Tests (DynamicFrame)¶

tests/integration/test_dynamicframe_ops.py:

"""
Integration tests for DynamicFrame transformations.
These need Glue Docker (has GlueContext libraries).
"""

import pytest
from pyspark.sql import Row
from awsglue.dynamicframe import DynamicFrame
from glue_jobs.lib.transformations import transform_customer_data_dyf


def test_transform_customer_data_with_dynamicframe(spark_session, glue_context):
    """Test full transformation with DynamicFrame"""
    # Arrange
    data = [
        Row(
            customer_id="1",
            name="Alice Smith",
            email="alice@example.com",
            status="active",
            total_spent=15000.0,
            created_at="2024-01-15"
        ),
        Row(
            customer_id="2",
            name="Bob Jones",
            email="bob@example.com",
            status="inactive",
            total_spent=5000.0,
            created_at="2024-02-20"
        ),
    ]
    df = spark_session.createDataFrame(data)
    input_dyf = DynamicFrame.fromDF(df, glue_context, "test_input")

    # Act
    result_dyf = transform_customer_data_dyf(input_dyf, glue_context)

    # Assert
    result_df = result_dyf.toDF()

    # Should only have active customer
    assert result_df.count() == 1

    # Check schema mapping worked
    assert "full_name" in result_df.columns
    assert "customer_tier" in result_df.columns

    # Check tier assignment
    row = result_df.first()
    assert row.full_name == "Alice Smith"
    assert row.customer_tier == "platinum"

Level 3: E2E Tests (Real Glue)¶

tests/e2e/test_customer_etl.py:

"""
E2E tests - run real Glue job with real data.
These are VALUABLE - catch integration issues.
"""

import pytest
import boto3
import awswrangler as wr


@pytest.fixture(scope="module")
def setup_test_data():
    """Load test data into real Glue catalog tables"""
    # Load CSV to S3, create Glue table
    yield
    # Cleanup


def test_customer_etl_end_to_end(setup_test_data):
    """E2E test: Run real Glue job, verify output"""
    glue_client = boto3.client('glue')

    # Run real Glue job
    response = glue_client.start_job_run(
        JobName='customer-etl-dev',
        Arguments={'--additional-python-modules': 'custom-lib==1.0.0'}
    )

    job_run_id = response['JobRunId']

    # Wait for completion
    waiter = glue_client.get_waiter('job_run_complete')
    waiter.wait(JobName='customer-etl-dev', RunId=job_run_id)

    # Verify output using awswrangler
    df = wr.athena.read_sql_query(
        sql="SELECT * FROM customers_db.customers_processed WHERE date = CURRENT_DATE",
        database="customers_db"
    )

    # Assertions
    assert len(df) > 0, "No output data"
    assert 'customer_tier' in df.columns
    assert df['customer_tier'].isin(['platinum', 'gold', 'silver', 'bronze']).all()

Pytest Configuration¶

tests/conftest.py:

"""Pytest fixtures for all test levels"""

import pytest
from pyspark.sql import SparkSession
from awsglue.context import GlueContext


@pytest.fixture(scope="session")
def spark_session():
    """Spark session for unit tests (no Glue)"""
    spark = (
        SparkSession.builder
        .master("local[2]")
        .appName("unit-tests")
        .getOrCreate()
    )
    yield spark
    spark.stop()


@pytest.fixture(scope="session")
def glue_context(spark_session):
    """GlueContext for integration tests (needs Glue Docker)"""
    from pyspark.context import SparkContext
    sc = spark_session.sparkContext
    glue_context = GlueContext(sc)
    return glue_context

pytest.ini:

[pytest]
testpaths = tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*
addopts = -v --tb=short

Interactive Development¶

Using Neovim with Molten¶

Start Jupyter in Docker:

make jupyter-start

Create notebook file (notebooks/customer_analysis.py):

# %% [markdown]
# # Customer Data Transformation - Local Development

# %%
# Initialize Glue context (works locally!)
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.dynamicframe import DynamicFrame

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

print("✅ GlueContext initialized locally!")

# %%
# Create sample data
from pyspark.sql import Row

data = [
    Row(customer_id="1", name="Alice", status="active", total_spent=15000.0),
    Row(customer_id="2", name="Bob", status="inactive", total_spent=5000.0),
]

df = spark.createDataFrame(data)
dyf = DynamicFrame.fromDF(df, glueContext, "customers")

print(f"Created DynamicFrame with {dyf.count()} records")
dyf.printSchema()

# %%
# Develop transformation interactively
def transform_customer_data(input_dyf: DynamicFrame) -> DynamicFrame:
    """Transform customer data"""
    from awsglue.transforms import Filter
    from pyspark.sql.functions import when, col

    # Filter active customers
    filtered = Filter.apply(
        frame=input_dyf,
        f=lambda x: x["status"] == "active"
    )

    # Add tier
    df = filtered.toDF()
    df_with_tier = df.withColumn(
        "customer_tier",
        when(col("total_spent") >= 10000, "platinum")
        .otherwise("gold")
    )

    return DynamicFrame.fromDF(df_with_tier, glueContext, "transformed")

# Test it!
result = transform_customer_data(dyf)
result.toDF().show()

In Neovim:

" Open notebook
:e notebooks/customer_analysis.py

" Initialize Molten kernel (connects to Docker Jupyter)
<leader>mi

" Run cell under cursor
<leader>ml   " Run line
<leader>mv   " Run visual selection

" Show output
<leader>mo

" Re-run cell
<leader>mr

Keybindings (from molten.lua config):

<leader>mi - Initialize kernel
<leader>ml - Evaluate line
<leader>mv - Evaluate visual selection
<leader>mr - Re-evaluate cell
<leader>mo - Show output
<leader>mh - Hide output

Alternative: Use Browser¶

# Open Jupyter Lab in browser
open http://localhost:8888

# Create notebook, develop interactively
# Full Glue libraries available!

Complete Development Cycle¶

Daily Workflow¶

# 1. Start Jupyter in Glue container
make jupyter-start

# 2. Interactive development in Neovim
nvim notebooks/customer_transform.py
# <leader>mi to initialize kernel
# Develop and test function interactively

# 3. Extract working function to module
# notebooks/customer_transform.py → glue_jobs/lib/transformations.py

# 4. Write unit tests (fast feedback)
nvim tests/unit/test_transformations.py
make test-unit  # Runs in seconds

# 5. Write integration tests
nvim tests/integration/test_dynamicframe_ops.py
make test-integration  # Runs in Docker

# 6. Update Glue job script
nvim glue_jobs/customer_etl.py

# 7. Final E2E validation
make test-e2e  # Runs real Glue job

# 8. Stop container
make jupyter-stop

Development to Production Path¶

1. Interactive dev (Neovim + local Jupyter)
   • Rapid iteration
   • Immediate feedback
   • Full Glue libraries
   ↓
2. Extract to modules (glue_jobs/lib/)
   • Testable functions
   • Reusable logic
   ↓
3. Unit tests (seconds)
   • Pure PySpark
   • No Glue dependencies
   ↓
4. Integration tests (30 sec)
   • DynamicFrame operations
   • Glue Docker environment
   ↓
5. E2E tests (2-3 min)
   • Real Glue job execution
   • Full validation
   ↓
6. Production deployment
   • Tested and validated
   • Confident deployment

Troubleshooting¶

Jupyter Won't Start¶

# Check logs
make jupyter-logs

# Verify container is running
docker ps | grep glue-jupyter

# Restart container
make jupyter-stop
make jupyter-start

Can't Connect from Neovim¶

# Verify Jupyter is accessible
curl http://localhost:8888

# Check available kernels
jupyter kernelspec list

# Verify kernel in Docker
docker-compose exec glue-jupyter jupyter kernelspec list

GlueContext Not Found¶

Make sure you're running code in the Docker container (via Molten connection to local Jupyter kernel, not a separate Python interpreter).

Import Errors in Tests¶

# Set PYTHONPATH in docker-compose.yml
environment:
  - PYTHONPATH=/home/hadoop/workspace

# Or in pytest
export PYTHONPATH=$PWD
pytest tests/unit/

Key Advantages¶

✅ 100% Local - No AWS Glue Interactive Session costs ✅ Full Glue Libraries - GlueContext, DynamicFrame, everything works ✅ Neovim Workflow - Edit with your dotfiles ✅ Fast Iteration - No network latency ✅ Real S3 Testing - Mount AWS credentials for dev S3 buckets ✅ Proper Test Pyramid - Unit → Integration → E2E ✅ Production Parity - Glue 5.0 Docker = production Glue 5.0