Lab 5 — Interactive Explainer

Model Deployment & Traffic Shifting

Deploy a model to a real-time endpoint, configure blue/green linear traffic shifting, monitor with CloudWatch alarms, and observe automatic rollback on failure.

🚀 Blue/Green Deploy 📊 CloudWatch Alarms ↩️ Auto-Rollback 🏢 HCM Context 🧪 Lab 5

📋 Lab 5 Overview

You've trained and tuned your model in Labs 3–4. Now it's time to put it into production. This lab teaches you how to deploy a model to a SageMaker real-time endpoint and safely shift traffic from an old model to a new one using blue/green deployment with linear traffic shifting — the same pattern used by AnyCompany to update fraud detection models without downtime.

Duration: ~45 minutes • Phase: Deployment • Prerequisite: Labs 3–4 (trained model artifacts in S3)

What You Build

Blue Fleet

📦 Model A

Production model
XGBoost 1.5-1
100% traffic initially

Traffic Shift

⚖️ Linear Policy

Gradual shift
CloudWatch monitoring
Auto-rollback on alarm

Green Fleet

✅ Model E

New model
Better performance
100% traffic after success

The twist: You first try deploying Model B (which has errors). The CloudWatch alarm fires, traffic automatically rolls back to Model A. Then you deploy Model E (the good model) successfully.

Key Concepts Covered

🚀

Real-Time Endpoints

SageMaker hosts your model on dedicated instances with auto-scaling. Invoke via API for sub-second predictions. The endpoint stays live 24/7.

🟢

Blue/Green Deployment

Run old (blue) and new (green) models simultaneously. Gradually shift traffic from blue to green. If green fails, instantly revert to blue.

🔔

CloudWatch Alarms

Monitor 5XX errors and model latency during deployment. If metrics breach thresholds, the alarm triggers automatic rollback — no human intervention needed.

↩️

Automatic Rollback

When the alarm fires, SageMaker stops the traffic shift and routes 100% back to the original model. Zero downtime, zero data loss.

🔄 Deployment Flow

The lab walks through a complete deployment lifecycle: create endpoint → test → set alarms → shift traffic → handle failure → retry with good model. Click each node to explore the stage.

🔄 Click any node below or press Auto-play to walk through the deployment lifecycle step by step.
🚀 Create Endpoint Model A (prod) 📊 Test & Monitor Invoke + CloudWatch 🔔 Set Alarms 5XX + Latency ⚖️ Shift Traffic Linear policy Verify Success or rollback
Deployment Details
StageSelect a node above

🟦 Blue/Green Deployment

Blue/green deployment maintains two identical production environments. The "blue" fleet runs the current model, while the "green" fleet hosts the new model. Traffic is gradually shifted from blue to green using a linear policy.

Linear Traffic Shifting Policy

Instead of switching 100% of traffic instantly (risky), linear shifting moves traffic in increments over a defined period. SageMaker monitors health at each step.

TimeBlue (Model A)Green (New Model)What Happens
T+0100%0%Deployment starts, green fleet provisioned
T+1 min75%25%First traffic batch shifted, alarms monitored
T+2 min50%50%Equal split — critical monitoring window
T+3 min25%75%Majority on green, final validation
T+4 min0%100%Complete — blue fleet decommissioned
💡 Why linear over canary? Canary sends a tiny percentage (e.g., 5%) to the new model first. Linear shifts in equal increments. For AnyCompany's payroll fraud model, linear is preferred because you need statistically significant traffic volume at each step to detect subtle accuracy regressions.

Deployment Strategies Compared

StrategyHow It WorksRollback SpeedRisk LevelBest For
All-at-onceInstant 0→100% switchManual (minutes)HighDev/test environments
CanarySmall % first, then allAutomatic (seconds)MediumLow-traffic endpoints
Linear (this lab)Equal increments over timeAutomatic (seconds)LowProduction ML models
Blue/GreenFull parallel fleet, DNS switchInstant (DNS)LowestMission-critical systems

The Two Deployments in This Lab

Attempt 1: Model B (Broken)

Intentionally deploys a model that throws 5XX errors. CloudWatch alarm fires → automatic rollback to Model A. Demonstrates the safety net works.

Attempt 2: Model E (Good)

Deploys the properly trained model. No errors during traffic shift → linear policy completes → Model E takes 100% traffic. Uses RetainDeploymentConfig=True to reuse alarm settings.

📊 Monitoring During Deployment

CloudWatch metrics are the eyes and ears of your deployment. SageMaker emits endpoint metrics automatically — you just need to set alarm thresholds that trigger rollback when something goes wrong.

Key Metrics Monitored

MetricNamespaceWhat It MeasuresAlarm Threshold (Lab 5)
Invocation5XXErrorsAWS/SageMakerServer-side errors (model crashes, OOM, bad predictions)> 1% error rate for 1 minute
ModelLatencyAWS/SageMakerTime for model to process a request (ms)> 5000ms average for 1 minute
Invocation4XXErrorsAWS/SageMakerClient-side errors (bad input format)Monitored but no alarm
OverheadLatencyAWS/SageMakerSageMaker infrastructure overhead (not model time)Monitored but no alarm
CPUUtilization/aws/sagemaker/EndpointsInstance CPU usage during inferenceMonitored for capacity planning
InvocationsAWS/SageMakerTotal number of requests processedMonitored for traffic volume

How Alarms Work

✅ OK State

Metric within threshold
Traffic shift continues

⚠️ INSUFFICIENT_DATA

Not enough data points yet
Shift pauses, waits for data

🚨 ALARM State

Threshold breached
Immediate rollback triggered

⚠️ Alarm configuration matters: Setting thresholds too tight causes false rollbacks (model is fine but alarm fires on a traffic spike). Too loose means real problems slip through. In production, tune thresholds based on baseline metrics from your current model.

What You Observe in the Lab

PhaseInvocations5XX ErrorsLatencyOutcome
Initial (Model A)~2000 requests0~50ms avgBaseline established
Shift to Model BTraffic splittingErrors appear ("E" in output)SpikesAlarm fires → rollback
After rollback100% back to A0~50msService restored
Shift to Model ETraffic splitting0DecreasingSuccessful deployment

↩️ Automatic Rollback

The safety net that makes blue/green deployment production-safe. When CloudWatch alarms fire during a traffic shift, SageMaker automatically reverts all traffic to the original model — no human intervention, no downtime.

Rollback Sequence

StepWhat HappensDuration
1. Alarm firesCloudWatch detects 5XX errors exceed 1% threshold for 1 minute~60 seconds
2. Traffic revertsSageMaker routes 100% traffic back to blue fleet (Model A)~10 seconds
3. Green fleet removedFailed model instances are terminated, endpoint config cleaned up~1–2 minutes
4. Status: InServiceEndpoint returns to stable state with original model serving all trafficImmediate
💡 Total downtime: near zero. During the rollback, requests that hit the green fleet may fail (those ~10 seconds), but the blue fleet continues serving. Clients with retry logic experience no visible outage.

RetainDeploymentConfig

After a rollback, you fix the model and try again. The RetainDeploymentConfig=True parameter tells SageMaker to reuse the same traffic routing policy and alarm configuration from the failed attempt — no need to reconfigure everything.

🔄

Without RetainDeploymentConfig

Must re-specify the entire DeploymentConfig block: BlueGreenUpdatePolicy, traffic routing, alarm references, wait intervals. Error-prone if done manually.

With RetainDeploymentConfig=True

Just provide the new EndpointConfigName. SageMaker reuses the linear policy, alarm ARNs, and wait intervals from the previous deployment. One line change.

Common Rollback Triggers

💥

Model Crashes (5XX)

New model throws exceptions — incompatible input format, missing dependencies, OOM on larger inputs. Most common in this lab.

High Latency

New model is too slow — larger architecture, unoptimized inference code, or insufficient instance size. Breaches latency SLA.

📉

Accuracy Degradation

Not directly monitored by CloudWatch alarms in this lab, but in production you'd add custom metrics comparing prediction distributions.

🏢 HCM Mapping — AnyCompany Context

How does blue/green deployment with traffic shifting apply to AnyCompany's ML products? Each product has different risk tolerance, traffic patterns, and rollback requirements.

Deployment Scenarios at AnyCompany

🏢 Click a scenario below to see how deployment strategies map to different AnyCompany ML products.
🚨

Payroll Fraud Detection

Real-time endpoint, zero tolerance for downtime. Missed fraud = $50K+ loss per incident.

💬

AnyCompany Assist (Chatbot)

High-traffic LLM endpoint. Latency-sensitive — users expect sub-2s responses.

📉

Attrition Prediction (Batch)

Monthly batch inference. Lower risk — can validate offline before switching.

📄

Document OCR Pipeline

Async processing of tax forms. Throughput matters more than latency.

Deployment Configuration
ScenarioSelect a card above

Lab 5 → AnyCompany Fraud Detection

Lab 5 ConceptAnyCompany EquivalentWhy It Matters
Model A (production)Current fraud model (v2.3)Serving millions of payroll transactions daily
Model B (broken)Model trained on corrupted dataWould flag legitimate transactions as fraud — business impact
Model E (improved)Retrained model with new fraud patternsCatches new fraud tactics from recent months
Linear traffic shiftGradual rollout across client segmentsStart with low-risk clients, expand to enterprise
5XX alarmPrediction failure rate alarmModel crashes = transactions processed without fraud check
Latency alarmSLA breach alarm (<200ms required)Payroll processing has strict time windows
Auto-rollbackInstant revert to proven modelCompliance requirement — cannot have unprotected window

Production Deployment Patterns

💡 Multi-region deployment: AnyCompany processes payroll across 40+ countries. A model update rolls out region-by-region: APAC first (lower volume), then EMEA, then Americas. Each region uses its own blue/green deployment with independent alarms. A failure in APAC doesn't affect Americas.
Deployment windows: Payroll fraud detection deploys during off-peak hours (weekends, after payroll cycles close). AnyCompany Assist deploys during low-traffic windows (2–4 AM local time). Attrition models deploy anytime — batch inference isn't latency-sensitive.