Amazon SageMaker Pipelines — Service Deep Dive

🔗 What is SageMaker Pipelines?

Amazon SageMaker Pipelines is a workflow orchestration service purpose-built for machine learning. It lets you define, automate, and manage end-to-end ML workflows as a Directed Acyclic Graph (DAG) — where each node is a step (processing, training, evaluation, deployment) and edges define data dependencies.

Think of it as a CI/CD pipeline for ML models: just as software teams use Jenkins or GitHub Actions to automate build→test→deploy, ML teams use Pipelines to automate data-prep→train→evaluate→register→deploy.

💡

Why it matters: Without orchestration, ML teams manually run notebooks, copy S3 paths between steps, and lose track of which model was trained on which data. Pipelines makes every run reproducible, auditable, and automated.

🏗️ Core Concepts

🔀

DAG (Directed Acyclic Graph)

Steps form a graph with directed edges. No cycles allowed — data flows forward. Steps without dependencies run in parallel automatically.

📦

Steps

Atomic units of work: Processing, Training, Tuning, Model Registration, Condition, Lambda, QualityCheck, and more. 18+ step types available.

🔗

Data Dependencies

Pass outputs from one step as inputs to the next using JsonPath notation. Pipelines resolves these at runtime to build the execution graph.

⚙️

Parameters

Runtime variables (instance type, data path, threshold) that let you reuse the same pipeline definition with different configurations.

📋

Model Registry

Version and catalog trained models with metadata, metrics, and approval status. Enables governance and controlled deployment.

🔄

Executions

Each pipeline run is an execution with its own status, artifacts, and lineage. Compare executions side-by-side in Studio.

🗺️ Course Coverage

SageMaker Pipelines appears across 6 modules and is the primary service in Lab 6 (90 minutes). It is the backbone of the MLOps story in this course.

🎯

Module connections: Module 4 (export to Pipelines), Module 6 (training pipelines), Module 8 (deployment steps), Module 9 (security in pipelines), Module 10 (full MLOps orchestration), Module 11 (monitoring triggers retraining pipeline).

🏋️ Module 6: Training 🚀 Module 8: Deployment ⚙️ Module 10: MLOps 📈 Module 11: Monitoring

📦 Pipeline Step Types

SageMaker Pipelines supports 18+ step types. Each step maps to a SageMaker API action. Here are the most important ones for ML workflows:

Step Type	What It Does	HCM Example
Processing	Run data processing scripts (Spark, SKLearn, custom)	Clean payroll data, compute features, split train/test
Training	Train a model using any algorithm or framework	Train XGBoost attrition model on employee data
Tuning	Run hyperparameter optimization (multiple training jobs)	Find optimal learning_rate and max_depth for fraud model
Model	Create or register a model in the registry	Register approved attrition model v2.3 with metrics
Condition	Branch logic (if/else) based on step outputs	Only register model if AUC > 0.85
QualityCheck	Run data or model quality checks against baselines	Verify no data drift in incoming payroll features
ClarifyCheck	Detect bias and explain model predictions	Check for gender/age bias in salary prediction model
Transform	Run batch inference on a dataset	Score all employees for attrition risk monthly
Lambda	Run custom serverless code (notifications, triggers)	Send Slack alert when model is registered
Fail	Conditionally stop pipeline with error message	Halt if data quality check fails (compliance gate)

🔗 How Steps Connect

Steps connect through data dependencies (output of one step feeds into the next) or custom dependencies (explicit ordering without data passing).

📊

Data Dependency

step_train.fit(inputs=step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri)
Training step automatically waits for processing to finish and uses its output path.

⏳

Custom Dependency

step_deploy.add_depends_on([step_evaluate])
Deploy step waits for evaluation even without direct data flow. Useful for approval gates.

⚡

Parallel Execution

Steps with no dependencies run simultaneously. Control concurrency with ParallelismConfiguration(max_parallel=5).

AnyCompany Pipeline Pattern

Fraud Detection Monthly Retrain: Processing (clean new payroll data) → Training (XGBoost) + ClarifyCheck (bias) run in parallel → Condition (recall > 0.95?) → Register Model → Lambda (notify compliance team)

🔀 Interactive DAG Visualizer

Click any node to explore that pipeline step. Or hit Auto-play to watch data flow through the entire pipeline with animated particles.

🎙️ Click any node to explore, or press Auto-play to watch the AnyCompany Attrition Pipeline execute.

💡 Click any node to learn about that step — or use Auto-play to watch data flow through the pipeline

🔄 Pipeline Parameters

Parameters make pipelines reusable. Define them once, override at execution time:

# Define parameters at pipeline creation
instance_type = ParameterString(default="ml.m5.xlarge")
train_data = ParameterString(default="s3://bucket/train/")
approval_status = ParameterString(default="PendingManualApproval")
auc_threshold = ParameterFloat(default=0.85)

# Override at execution time
pipeline.start(parameters={
"instance_type": "ml.m5.4xlarge",
"auc_threshold": 0.90
})

AnyCompany Parameter Strategy

Same pipeline, different configs: Use instance_type=ml.m5.xlarge for dev runs, ml.m5.4xlarge for production. Use auc_threshold=0.80 for experimentation, 0.95 for production fraud model quality gate.

🐍 Two Ways to Build Pipelines

SageMaker offers both a code-first SDK approach and a visual drag-and-drop designer. Choose based on your team's workflow:

💻

Python SDK

Best for: ML Engineers who want version-controlled, parameterized pipelines in Git.
How: Define steps in Python, chain with data dependencies, call pipeline.create().
Lab 6 uses this approach.

🎨

Pipeline Designer (Visual)

Best for: Rapid prototyping, stakeholder demos, teams new to MLOps.
How: Drag step nodes onto canvas in Studio, connect edges, configure in sidebar.
New: Execute Code step for custom Python.

💻 SDK Pipeline Structure

A pipeline definition in Python follows this pattern:

# 1. Create a PipelineSession
from sagemaker.workflow.pipeline_context import PipelineSession
session = PipelineSession()

# 2. Define steps
step_process = ProcessingStep(name="PrepareData", ...)
step_train = TrainingStep(name="TrainModel", ...)
step_eval = ProcessingStep(name="EvaluateModel", ...)
step_cond = ConditionStep(name="CheckMetrics", ...)
step_register = ModelStep(name="RegisterModel", ...)

# 3. Assemble pipeline
pipeline = Pipeline(
  name="AnyCompany-Attrition-Pipeline",
  parameters=[instance_type, train_data, auc_threshold],
  steps=[step_process, step_train, step_eval, step_cond]
)

# 4. Create and execute
pipeline.upsert(role_arn=role)
execution = pipeline.start()

🎯 The @step Decorator

The newest way to create pipeline steps — convert any Python function into a pipeline step with a single decorator:

from sagemaker.workflow.function_step import step

@step(name="preprocess", instance_type="ml.m5.xlarge")
def preprocess(input_path: str) -> str:
  # Your existing preprocessing code — unchanged
  df = pd.read_csv(input_path)
  df = df.dropna().drop_duplicates()
  output_path = "/opt/ml/processing/output/clean.csv"
  df.to_csv(output_path)
  return output_path

⚡

Lift-and-shift: The @step decorator lets you take existing notebook code and convert it to a pipeline step without rewriting. Test locally first, then add the decorator to orchestrate.

📊 Monitoring Executions in Studio

After starting a pipeline, Studio provides real-time visibility:

🔀

DAG View

Visual graph showing step status (running/succeeded/failed) with color coding. Click any node to see logs.

📈

Execution History

Compare multiple runs side-by-side. See which parameters changed and how metrics improved.

🔍

Lineage Tracking

Trace any model back to its training data, processing code, and hyperparameters. Full audit trail.

🔄

Retry Policy

Configure automatic retries for transient failures (capacity errors, throttling). Set MaxAttempts per step.

🏗️ Pipeline Builder — Design Your Own

Select an AnyCompany HCM scenario to see a complete pipeline configuration with step types, parameters, quality gates, and deployment strategy.

🔍

Payroll Fraud Detection

Monthly retrain on new payroll transactions. High recall required. Auto-deploy if quality gate passes.

🚪

Employee Attrition

Quarterly retrain on HR data. Bias check mandatory. Manual approval before production.

💬

AnyCompany Assist (LLM)

Continuous fine-tuning on support tickets. A/B test before full rollout. Human-in-the-loop evaluation.

📋 Pipeline Configuration

🏢 Why Pipelines Matter for AnyCompany

AnyCompany processes payroll for millions of workers across multiple countries. ML models must be retrained regularly as workforce patterns change. Without orchestration, this is manual, error-prone, and unauditable.

🔄

Reproducibility

Every model can be traced back to exact data, code, and parameters. Critical for compliance audits across 140+ jurisdictions.

⏰

Automation

Monthly payroll data triggers retraining automatically. No manual notebook runs. No forgotten steps.

🛡️

Quality Gates

Condition steps enforce minimum metrics before deployment. A fraud model with recall < 95% never reaches production.

👥

Governance

Model Registry tracks versions with approval workflows. Compliance team reviews before production deployment.

🔄 Pipeline Triggers in Production

Pipelines can be triggered by multiple events in a production HCM system:

Trigger	Mechanism	HCM Scenario
Schedule	EventBridge cron rule	Monthly fraud model retrain after payroll cycle closes
Data Arrival	S3 event → Lambda → pipeline.start()	New country onboarded → retrain multi-country model
Drift Detected	Model Monitor → CloudWatch → Lambda	Feature drift in attrition model triggers retraining
Manual	Studio UI or API call	Data science team experiments with new features
CI/CD	CodePipeline / GitHub Actions	New preprocessing code merged → validate with pipeline run

🔑 Key Takeaways

1️⃣

DAG = Dependency Graph

Define what depends on what. Pipelines figures out execution order and parallelism automatically.

2️⃣

Parameters = Reusability

One pipeline definition serves dev, staging, and production. Override parameters at runtime.

3️⃣

Condition = Quality Gate

Never deploy a bad model. Condition steps enforce metric thresholds before registration.

4️⃣

Registry = Governance

Every model version is cataloged with metrics, lineage, and approval status. Audit-ready.