Amazon SageMaker Pipelines

ML workflow orchestration that automates the entire machine learning lifecycle β€” from data processing through model deployment β€” as a reproducible, auditable directed acyclic graph (DAG).

πŸ”— Lab 6 Service βš™οΈ MLOps 🎯 Interactive πŸ“Š 6 Modules Coverage

πŸ”— What is SageMaker Pipelines?

Amazon SageMaker Pipelines is a workflow orchestration service purpose-built for machine learning. It lets you define, automate, and manage end-to-end ML workflows as a Directed Acyclic Graph (DAG) β€” where each node is a step (processing, training, evaluation, deployment) and edges define data dependencies.

Think of it as a CI/CD pipeline for ML models: just as software teams use Jenkins or GitHub Actions to automate build→test→deploy, ML teams use Pipelines to automate data-prep→train→evaluate→register→deploy.

πŸ’‘
Why it matters: Without orchestration, ML teams manually run notebooks, copy S3 paths between steps, and lose track of which model was trained on which data. Pipelines makes every run reproducible, auditable, and automated.

πŸ—οΈ Core Concepts

πŸ”€

DAG (Directed Acyclic Graph)

Steps form a graph with directed edges. No cycles allowed β€” data flows forward. Steps without dependencies run in parallel automatically.

πŸ“¦

Steps

Atomic units of work: Processing, Training, Tuning, Model Registration, Condition, Lambda, QualityCheck, and more. 18+ step types available.

πŸ”—

Data Dependencies

Pass outputs from one step as inputs to the next using JsonPath notation. Pipelines resolves these at runtime to build the execution graph.

βš™οΈ

Parameters

Runtime variables (instance type, data path, threshold) that let you reuse the same pipeline definition with different configurations.

πŸ“‹

Model Registry

Version and catalog trained models with metadata, metrics, and approval status. Enables governance and controlled deployment.

πŸ”„

Executions

Each pipeline run is an execution with its own status, artifacts, and lineage. Compare executions side-by-side in Studio.

πŸ—ΊοΈ Course Coverage

SageMaker Pipelines appears across 6 modules and is the primary service in Lab 6 (90 minutes). It is the backbone of the MLOps story in this course.

🎯
Module connections: Module 4 (export to Pipelines), Module 6 (training pipelines), Module 8 (deployment steps), Module 9 (security in pipelines), Module 10 (full MLOps orchestration), Module 11 (monitoring triggers retraining pipeline).
πŸ‹οΈ Module 6: Training πŸš€ Module 8: Deployment βš™οΈ Module 10: MLOps πŸ“ˆ Module 11: Monitoring

πŸ“¦ Pipeline Step Types

SageMaker Pipelines supports 18+ step types. Each step maps to a SageMaker API action. Here are the most important ones for ML workflows:

Step TypeWhat It DoesHCM Example
ProcessingRun data processing scripts (Spark, SKLearn, custom)Clean payroll data, compute features, split train/test
TrainingTrain a model using any algorithm or frameworkTrain XGBoost attrition model on employee data
TuningRun hyperparameter optimization (multiple training jobs)Find optimal learning_rate and max_depth for fraud model
ModelCreate or register a model in the registryRegister approved attrition model v2.3 with metrics
ConditionBranch logic (if/else) based on step outputsOnly register model if AUC > 0.85
QualityCheckRun data or model quality checks against baselinesVerify no data drift in incoming payroll features
ClarifyCheckDetect bias and explain model predictionsCheck for gender/age bias in salary prediction model
TransformRun batch inference on a datasetScore all employees for attrition risk monthly
LambdaRun custom serverless code (notifications, triggers)Send Slack alert when model is registered
FailConditionally stop pipeline with error messageHalt if data quality check fails (compliance gate)

πŸ”— How Steps Connect

Steps connect through data dependencies (output of one step feeds into the next) or custom dependencies (explicit ordering without data passing).

πŸ“Š

Data Dependency

step_train.fit(inputs=step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri)
Training step automatically waits for processing to finish and uses its output path.

⏳

Custom Dependency

step_deploy.add_depends_on([step_evaluate])
Deploy step waits for evaluation even without direct data flow. Useful for approval gates.

⚑

Parallel Execution

Steps with no dependencies run simultaneously. Control concurrency with ParallelismConfiguration(max_parallel=5).

AnyCompany Pipeline Pattern

Fraud Detection Monthly Retrain: Processing (clean new payroll data) β†’ Training (XGBoost) + ClarifyCheck (bias) run in parallel β†’ Condition (recall > 0.95?) β†’ Register Model β†’ Lambda (notify compliance team)

πŸ”€ Interactive DAG Visualizer

Click any node to explore that pipeline step. Or hit Auto-play to watch data flow through the entire pipeline with animated particles.

πŸŽ™οΈ Click any node to explore, or press Auto-play to watch the AnyCompany Attrition Pipeline execute.
IF AUC > 0.85 πŸ“₯ Load Data S3 β†’ Processing βš™οΈ Process Clean + Split πŸ‹οΈ Train XGBoost πŸ“Š Evaluate Metrics πŸ”€ Condition Quality Gate πŸ“‹ Register Model Registry πŸš€ Deploy Endpoint

πŸ’‘ Click any node to learn about that step β€” or use Auto-play to watch data flow through the pipeline

πŸ”„ Pipeline Parameters

Parameters make pipelines reusable. Define them once, override at execution time:

# Define parameters at pipeline creation
instance_type = ParameterString(default="ml.m5.xlarge")
train_data = ParameterString(default="s3://bucket/train/")
approval_status = ParameterString(default="PendingManualApproval")
auc_threshold = ParameterFloat(default=0.85)

# Override at execution time
pipeline.start(parameters={
  "instance_type": "ml.m5.4xlarge",
  "auc_threshold": 0.90
})
AnyCompany Parameter Strategy

Same pipeline, different configs: Use instance_type=ml.m5.xlarge for dev runs, ml.m5.4xlarge for production. Use auc_threshold=0.80 for experimentation, 0.95 for production fraud model quality gate.

🐍 Two Ways to Build Pipelines

SageMaker offers both a code-first SDK approach and a visual drag-and-drop designer. Choose based on your team's workflow:

πŸ’»

Python SDK

Best for: ML Engineers who want version-controlled, parameterized pipelines in Git.
How: Define steps in Python, chain with data dependencies, call pipeline.create().
Lab 6 uses this approach.

🎨

Pipeline Designer (Visual)

Best for: Rapid prototyping, stakeholder demos, teams new to MLOps.
How: Drag step nodes onto canvas in Studio, connect edges, configure in sidebar.
New: Execute Code step for custom Python.

πŸ’» SDK Pipeline Structure

A pipeline definition in Python follows this pattern:

# 1. Create a PipelineSession
from sagemaker.workflow.pipeline_context import PipelineSession
session = PipelineSession()

# 2. Define steps
step_process = ProcessingStep(name="PrepareData", ...)
step_train = TrainingStep(name="TrainModel", ...)
step_eval = ProcessingStep(name="EvaluateModel", ...)
step_cond = ConditionStep(name="CheckMetrics", ...)
step_register = ModelStep(name="RegisterModel", ...)

# 3. Assemble pipeline
pipeline = Pipeline(
  name="AnyCompany-Attrition-Pipeline",
  parameters=[instance_type, train_data, auc_threshold],
  steps=[step_process, step_train, step_eval, step_cond]
)

# 4. Create and execute
pipeline.upsert(role_arn=role)
execution = pipeline.start()

🎯 The @step Decorator

The newest way to create pipeline steps β€” convert any Python function into a pipeline step with a single decorator:

from sagemaker.workflow.function_step import step

@step(name="preprocess", instance_type="ml.m5.xlarge")
def preprocess(input_path: str) -> str:
  # Your existing preprocessing code β€” unchanged
  df = pd.read_csv(input_path)
  df = df.dropna().drop_duplicates()
  output_path = "/opt/ml/processing/output/clean.csv"
  df.to_csv(output_path)
  return output_path
⚑
Lift-and-shift: The @step decorator lets you take existing notebook code and convert it to a pipeline step without rewriting. Test locally first, then add the decorator to orchestrate.

πŸ“Š Monitoring Executions in Studio

After starting a pipeline, Studio provides real-time visibility:

πŸ”€

DAG View

Visual graph showing step status (running/succeeded/failed) with color coding. Click any node to see logs.

πŸ“ˆ

Execution History

Compare multiple runs side-by-side. See which parameters changed and how metrics improved.

πŸ”

Lineage Tracking

Trace any model back to its training data, processing code, and hyperparameters. Full audit trail.

πŸ”„

Retry Policy

Configure automatic retries for transient failures (capacity errors, throttling). Set MaxAttempts per step.

πŸ—οΈ Pipeline Builder β€” Design Your Own

Select an AnyCompany HCM scenario to see a complete pipeline configuration with step types, parameters, quality gates, and deployment strategy.

πŸ”

Payroll Fraud Detection

Monthly retrain on new payroll transactions. High recall required. Auto-deploy if quality gate passes.

πŸšͺ

Employee Attrition

Quarterly retrain on HR data. Bias check mandatory. Manual approval before production.

πŸ’¬

AnyCompany Assist (LLM)

Continuous fine-tuning on support tickets. A/B test before full rollout. Human-in-the-loop evaluation.

πŸ“‹ Pipeline Configuration

🏒 Why Pipelines Matter for AnyCompany

AnyCompany processes payroll for millions of workers across multiple countries. ML models must be retrained regularly as workforce patterns change. Without orchestration, this is manual, error-prone, and unauditable.

πŸ”„

Reproducibility

Every model can be traced back to exact data, code, and parameters. Critical for compliance audits across 140+ jurisdictions.

⏰

Automation

Monthly payroll data triggers retraining automatically. No manual notebook runs. No forgotten steps.

πŸ›‘οΈ

Quality Gates

Condition steps enforce minimum metrics before deployment. A fraud model with recall < 95% never reaches production.

πŸ‘₯

Governance

Model Registry tracks versions with approval workflows. Compliance team reviews before production deployment.

πŸ”„ Pipeline Triggers in Production

Pipelines can be triggered by multiple events in a production HCM system:

TriggerMechanismHCM Scenario
ScheduleEventBridge cron ruleMonthly fraud model retrain after payroll cycle closes
Data ArrivalS3 event β†’ Lambda β†’ pipeline.start()New country onboarded β†’ retrain multi-country model
Drift DetectedModel Monitor β†’ CloudWatch β†’ LambdaFeature drift in attrition model triggers retraining
ManualStudio UI or API callData science team experiments with new features
CI/CDCodePipeline / GitHub ActionsNew preprocessing code merged β†’ validate with pipeline run

πŸ”‘ Key Takeaways

1️⃣

DAG = Dependency Graph

Define what depends on what. Pipelines figures out execution order and parallelism automatically.

2️⃣

Parameters = Reusability

One pipeline definition serves dev, staging, and production. Override parameters at runtime.

3️⃣

Condition = Quality Gate

Never deploy a bad model. Condition steps enforce metric thresholds before registration.

4️⃣

Registry = Governance

Every model version is cataloged with metrics, lineage, and approval status. Audit-ready.