Lab 5: Model Deployment & Traffic Shifting

📋 Lab 5 Overview

You've trained and tuned your model in Labs 3–4. Now it's time to put it into production. This lab teaches you how to deploy a model to a SageMaker real-time endpoint and safely shift traffic from an old model to a new one using blue/green deployment with linear traffic shifting — the same pattern used by AnyCompany to update fraud detection models without downtime.

Duration: ~45 minutes • Phase: Deployment • Prerequisite: Labs 3–4 (trained model artifacts in S3)

What You Build

Blue Fleet

📦 Model A

Production model
XGBoost 1.5-1
100% traffic initially

→

Traffic Shift

⚖️ Linear Policy

Gradual shift
CloudWatch monitoring
Auto-rollback on alarm

→

Green Fleet

✅ Model E

New model
Better performance
100% traffic after success

The twist: You first try deploying Model B (which has errors). The CloudWatch alarm fires, traffic automatically rolls back to Model A. Then you deploy Model E (the good model) successfully.

Key Concepts Covered

🚀

Real-Time Endpoints

SageMaker hosts your model on dedicated instances with auto-scaling. Invoke via API for sub-second predictions. The endpoint stays live 24/7.

🟢

Blue/Green Deployment

Run old (blue) and new (green) models simultaneously. Gradually shift traffic from blue to green. If green fails, instantly revert to blue.

🔔

CloudWatch Alarms

Monitor 5XX errors and model latency during deployment. If metrics breach thresholds, the alarm triggers automatic rollback — no human intervention needed.

↩️

Automatic Rollback

When the alarm fires, SageMaker stops the traffic shift and routes 100% back to the original model. Zero downtime, zero data loss.

🔄 Deployment Flow

The lab walks through a complete deployment lifecycle: create endpoint → test → set alarms → shift traffic → handle failure → retry with good model. Click each node to explore the stage.

🔄 Click any node below or press Auto-play to walk through the deployment lifecycle step by step.

Deployment Details

StageSelect a node above

🟦 Blue/Green Deployment

Blue/green deployment maintains two identical production environments. The "blue" fleet runs the current model, while the "green" fleet hosts the new model. Traffic is gradually shifted from blue to green using a linear policy.

Linear Traffic Shifting Policy

Instead of switching 100% of traffic instantly (risky), linear shifting moves traffic in increments over a defined period. SageMaker monitors health at each step.

Time	Blue (Model A)	Green (New Model)	What Happens
T+0	100%	0%	Deployment starts, green fleet provisioned
T+1 min	75%	25%	First traffic batch shifted, alarms monitored
T+2 min	50%	50%	Equal split — critical monitoring window
T+3 min	25%	75%	Majority on green, final validation
T+4 min	0%	100%	Complete — blue fleet decommissioned

💡 Why linear over canary? Canary sends a tiny percentage (e.g., 5%) to the new model first. Linear shifts in equal increments. For AnyCompany's payroll fraud model, linear is preferred because you need statistically significant traffic volume at each step to detect subtle accuracy regressions.

Deployment Strategies Compared

Strategy	How It Works	Rollback Speed	Risk Level	Best For
All-at-once	Instant 0→100% switch	Manual (minutes)	High	Dev/test environments
Canary	Small % first, then all	Automatic (seconds)	Medium	Low-traffic endpoints
Linear (this lab)	Equal increments over time	Automatic (seconds)	Low	Production ML models
Blue/Green	Full parallel fleet, DNS switch	Instant (DNS)	Lowest	Mission-critical systems

The Two Deployments in This Lab

❌

Attempt 1: Model B (Broken)

Intentionally deploys a model that throws 5XX errors. CloudWatch alarm fires → automatic rollback to Model A. Demonstrates the safety net works.

✅

Attempt 2: Model E (Good)

Deploys the properly trained model. No errors during traffic shift → linear policy completes → Model E takes 100% traffic. Uses RetainDeploymentConfig=True to reuse alarm settings.

📊 Monitoring During Deployment

CloudWatch metrics are the eyes and ears of your deployment. SageMaker emits endpoint metrics automatically — you just need to set alarm thresholds that trigger rollback when something goes wrong.

Key Metrics Monitored

Metric	Namespace	What It Measures	Alarm Threshold (Lab 5)
Invocation5XXErrors	AWS/SageMaker	Server-side errors (model crashes, OOM, bad predictions)	> 1% error rate for 1 minute
ModelLatency	AWS/SageMaker	Time for model to process a request (ms)	> 5000ms average for 1 minute
Invocation4XXErrors	AWS/SageMaker	Client-side errors (bad input format)	Monitored but no alarm
OverheadLatency	AWS/SageMaker	SageMaker infrastructure overhead (not model time)	Monitored but no alarm
CPUUtilization	/aws/sagemaker/Endpoints	Instance CPU usage during inference	Monitored for capacity planning
Invocations	AWS/SageMaker	Total number of requests processed	Monitored for traffic volume

How Alarms Work

✅ OK State

Metric within threshold
Traffic shift continues

⚠️ INSUFFICIENT_DATA

Not enough data points yet
Shift pauses, waits for data

🚨 ALARM State

Threshold breached
Immediate rollback triggered

⚠️ Alarm configuration matters: Setting thresholds too tight causes false rollbacks (model is fine but alarm fires on a traffic spike). Too loose means real problems slip through. In production, tune thresholds based on baseline metrics from your current model.

What You Observe in the Lab

Phase	Invocations	5XX Errors	Latency	Outcome
Initial (Model A)	~2000 requests	0	~50ms avg	Baseline established
Shift to Model B	Traffic splitting	Errors appear ("E" in output)	Spikes	Alarm fires → rollback
After rollback	100% back to A	0	~50ms	Service restored
Shift to Model E	Traffic splitting	0	Decreasing	Successful deployment

↩️ Automatic Rollback

The safety net that makes blue/green deployment production-safe. When CloudWatch alarms fire during a traffic shift, SageMaker automatically reverts all traffic to the original model — no human intervention, no downtime.

Rollback Sequence

Step	What Happens	Duration
1. Alarm fires	CloudWatch detects 5XX errors exceed 1% threshold for 1 minute	~60 seconds
2. Traffic reverts	SageMaker routes 100% traffic back to blue fleet (Model A)	~10 seconds
3. Green fleet removed	Failed model instances are terminated, endpoint config cleaned up	~1–2 minutes
4. Status: InService	Endpoint returns to stable state with original model serving all traffic	Immediate

💡 Total downtime: near zero. During the rollback, requests that hit the green fleet may fail (those ~10 seconds), but the blue fleet continues serving. Clients with retry logic experience no visible outage.

RetainDeploymentConfig

After a rollback, you fix the model and try again. The RetainDeploymentConfig=True parameter tells SageMaker to reuse the same traffic routing policy and alarm configuration from the failed attempt — no need to reconfigure everything.

🔄

Without RetainDeploymentConfig

Must re-specify the entire DeploymentConfig block: BlueGreenUpdatePolicy, traffic routing, alarm references, wait intervals. Error-prone if done manually.

✅

With RetainDeploymentConfig=True

Just provide the new EndpointConfigName. SageMaker reuses the linear policy, alarm ARNs, and wait intervals from the previous deployment. One line change.

Common Rollback Triggers

💥

Model Crashes (5XX)

New model throws exceptions — incompatible input format, missing dependencies, OOM on larger inputs. Most common in this lab.

⏳

High Latency

New model is too slow — larger architecture, unoptimized inference code, or insufficient instance size. Breaches latency SLA.

📉

Accuracy Degradation

Not directly monitored by CloudWatch alarms in this lab, but in production you'd add custom metrics comparing prediction distributions.

🏢 HCM Mapping — AnyCompany Context

How does blue/green deployment with traffic shifting apply to AnyCompany's ML products? Each product has different risk tolerance, traffic patterns, and rollback requirements.

Deployment Scenarios at AnyCompany

🏢 Click a scenario below to see how deployment strategies map to different AnyCompany ML products.

🚨

Payroll Fraud Detection

Real-time endpoint, zero tolerance for downtime. Missed fraud = $50K+ loss per incident.

💬

AnyCompany Assist (Chatbot)

High-traffic LLM endpoint. Latency-sensitive — users expect sub-2s responses.

📉

Attrition Prediction (Batch)

Monthly batch inference. Lower risk — can validate offline before switching.

📄

Document OCR Pipeline

Async processing of tax forms. Throughput matters more than latency.

Deployment Configuration

ScenarioSelect a card above

Lab 5 → AnyCompany Fraud Detection

Lab 5 Concept	AnyCompany Equivalent	Why It Matters
Model A (production)	Current fraud model (v2.3)	Serving millions of payroll transactions daily
Model B (broken)	Model trained on corrupted data	Would flag legitimate transactions as fraud — business impact
Model E (improved)	Retrained model with new fraud patterns	Catches new fraud tactics from recent months
Linear traffic shift	Gradual rollout across client segments	Start with low-risk clients, expand to enterprise
5XX alarm	Prediction failure rate alarm	Model crashes = transactions processed without fraud check
Latency alarm	SLA breach alarm (<200ms required)	Payroll processing has strict time windows
Auto-rollback	Instant revert to proven model	Compliance requirement — cannot have unprotected window

Production Deployment Patterns

💡 Multi-region deployment: AnyCompany processes payroll across 40+ countries. A model update rolls out region-by-region: APAC first (lower volume), then EMEA, then Americas. Each region uses its own blue/green deployment with independent alarms. A failure in APAC doesn't affect Americas.

⏰ Deployment windows: Payroll fraud detection deploys during off-peak hours (weekends, after payroll cycles close). AnyCompany Assist deploys during low-traffic windows (2–4 AM local time). Attrition models deploy anytime — batch inference isn't latency-sensitive.