🧪 Hands-On Lab Guide

Your roadmap through 7 labs covering the full ML lifecycle on AWS — from raw data to production monitoring

🧪 7 Labs ⏱ ~5 hours total 🛠 Amazon SageMaker 📊 Full ML Lifecycle
🔄
Lab Architecture Pipeline

Click any lab node to see its details. Labs follow the ML lifecycle from data preparation through production monitoring.

Lab 1
Data Wrangling
1h 15m
Data Prep
Lab 2
Processing
15m
Data Prep
Lab 3
Training
25m
Training
Lab 4
Tuning
30m
Training
Lab 5
Deployment
45m
Deployment
Lab 6
Pipelines
1h 30m
MLOps
Lab 7
Monitoring
45m
MLOps
📋
Lab Details & Task Flow
1
Analyze and Prepare Data with SageMaker Data Wrangler and Amazon EMR
⏱ 1 hour 15 minutes • Data Preparation & Feature Engineering
🚀 Open Interactive Explainer →
Scenario: A citizen advocacy group needs to predict whether individuals earn less than $50K/year to target government assistance promotions. You prepare and transform demographic data using SageMaker Data Wrangler and Apache Spark on Amazon EMR.

🎯 Objectives

  • Choose effective methods for visualizing data
  • Process missing values, outliers, and duplicated data
  • Define key encoding techniques (ordinal, one-hot)
  • Ingest and transform data in SageMaker Data Wrangler
  • Transform data using Spark on Amazon EMR

📝 Task Flow

1
Import & Preliminary Analysis
Import adult_data.csv into Data Wrangler, generate Data Quality & Insights Report
2
Analyze & Visualize Data
Generate histograms, check for target leakage using ROCROC — Receiver Operating CharacteristicA metric measuring how well a single feature can distinguish between classes. ROC = 0.5 means the feature has zero predictive power (like a coin flip). ROC close to 1.0 suggests the feature is too predictive — possible target leakage. In Lab 1, SageMaker Data Wrangler computes ROC per feature to flag leaky or useless columns. analysis
3
Data Transformations & Export
Drop low-predictive columns, handle missing values, remove outliers, encode categoricals, split 70/20/10, export to S3
4
Set Up SageMaker Studio
Launch JupyterLab workspace, clone Git repository
5
Connect to EMR Cluster
Discover and connect to EMR cluster using SparkMagic PySpark kernel
6
Explore & Query with Spark
Run exploratory data analysis using Apache Spark on EMR
💡 Deep Dive: Why Target Leakage Analysis? (Task 2)
What is Target Leakage?

Target leakage occurs when a feature in your training data is strongly correlated with the target label but would not be available at prediction time in the real world. It is essentially “cheating” — the model gets information that leaks the answer, scores perfectly in training, but fails in production.

How Data Wrangler Detects It

SageMaker Data Wrangler computes the ROC score for each feature column individually via cross-validation. The ROC score tells you how well that single feature alone can predict the target:

ROC ScoreMeaningAction
0.5Zero predictive power (coin flip)Drop — adds noise without helping
0.6 – 0.8Normal predictive powerKeep — good features
0.9 – 1.0Suspiciously highInvestigate — possible leakage!
What Lab 1 Finds in the Adult Dataset

The adult_data.csv has 14 features predicting income (≤$50K or >$50K). The target leakage report reveals:

FeatureROCVerdict
education_num~0.7✅ Good predictor — keep
marital_status~0.7✅ Good predictor — keep
hours_per_week~0.65✅ Moderate — keep
fnlwgt~0.5❌ Useless — drop
native_country~0.5❌ Useless — drop

Result: No target leakage detected (no feature near 1.0). But fnlwgt (census sampling weight) and native_country are dropped because ROC = 0.5 means they contribute nothing — just noise that slows training.

Why This Matters

Prevents false confidence: A leaked feature gives 99% accuracy in training but 50% in production. Target leakage analysis catches this before you waste weeks on a model that will fail when deployed.

Removes noise: Features with ROC = 0.5 add computational cost without improving predictions. Dropping them makes training faster and can improve accuracy.

HCM Analogy

Imagine building an attrition prediction model (will this employee leave?):

FeatureROCProblem
exit_interview_date~1.0🚨 Leakage! If populated, they already left
badge_color~0.5❌ Useless — no correlation
months_since_promotion~0.7✅ Genuinely predictive — keep
🔧 Deep Dive: Data Transformations & Export (Task 3)
Why Transform Before Training?

ML algorithms expect clean, numeric, consistently formatted data. Raw datasets from production systems contain messy strings, missing values, outliers, and mixed formats. Task 3 applies a transformation pipeline that converts raw census data into model-ready features.

🗺 Transformation Pipeline (8 Steps)

The complete sequence of transforms applied in Data Wrangler, in order:

STEP 1 Drop Columns fnlwgt, native_country STEP 2 Drop Missing occupation, workclass = ? STEP 3 Drop Duplicates Remove identical rows STEP 4 Format String Strip left & right spaces STEP 5 Encode Target <=50K→1, >50K→0 STEP 6 Handle Outliers capital_gain > 80K removed STEP 7 Ordinal Encode education, occupation STEP 8 One-Hot Encode sex, race, workclass, ... STEP 9 Move Target to Col 0 XGBoost requirement STEP 10 Split 70/20/10 Train / Test / Validation STEP 11 Export to S3 3 CSVs → train/test/val Clean Format Target Outliers Encode
🔴 Step 5: Target Encoding — Why ≤50K = 1 and >50K = 0?

The income column contains string labels: "<=50K" and ">50K". XGBoost (and most ML algorithms) require numeric targets for binary classification. Data Wrangler’s Search and Edit → Find and Replace converts them:

Original ValueReplaced WithMeaning
<=50K1Positive class (eligible for assistance)
>50K0Negative class (not eligible)

Why this encoding? The advocacy group wants to find people who earn ≤$50K. Making that the positive class (1) means the model’s recall metric directly measures “what % of eligible people did we correctly identify?” — the metric the business cares about most.

XGBoost requirement: The target column must also be the first column in the CSV (Step 9). XGBoost’s built-in SageMaker container reads column 0 as the label by default — no header row, no column name mapping. If you forget this step, the model trains on the wrong column silently.

🟢 Step 4: Format String — Strip Left & Right

The adult dataset has categorical columns with inconsistent whitespace — values like " Private", "Private ", and " Private " would be treated as three different categories without cleaning.

OperationWhat It DoesExample
Strip left/rightRemove leading & trailing whitespace" Private " → "Private"
Applied toeducation, income, marital_status, occupation, race, relationship, sex, workclass

Why it matters: Without stripping, one-hot encoding creates duplicate columns for " Private" vs "Private" — inflating dimensionality and confusing the model. Also, the target encoding (Step 5) would fail: " <=50K" wouldn’t match the pattern "<=50K", leaving string values in the target column and crashing training.

🔵 Steps 1–3: Cleaning (Drop, Missing, Duplicates)
StepActionRationale
1. Drop columnsRemove fnlwgt, native_countryROC = 0.5 (zero predictive power, just noise)
2. Drop missingRemove rows where occupation or workclass = "?"~6% of rows; random pattern, safe to drop
3. Drop duplicatesRemove identical rowsPrevents model from memorizing repeated examples
🟠 Step 6: Handle Outliers

The capital_gain column has extreme values (e.g., 99,999) that represent rare stock sales. These outliers skew gradient calculations during training. Data Wrangler removes values > 80,000 or < 0 using Min-Max numeric outliers with the Remove fix method.

🟣 Steps 7–8: Encoding Categoricals
EncodingWhen to UseLab 1 Columns
OrdinalNatural order existseducation (HS < Bachelors < Masters < PhD), education_num, occupation
One-HotNo inherent ordermarital_status, race, relationship, sex, workclass

Trap: One-hot encoding native_country (41 unique values) would create 41 sparse columns. Since ROC = 0.5 (useless), we dropped it in Step 1 before encoding — saving 41 dimensions and speeding up training.

🟢 Steps 9–11: Finalize & Export
StepActionWhy
9. Move targetMove income to column 0XGBoost reads first column as label (no header)
10. Split70% train / 20% test / 10% validationSeparate data for learning, tuning, and final eval
11. Export3 CSVs to S3 (train/, test/, validation/)Ready for SageMaker Training Job in Lab 3
HCM Analogy

Imagine preparing an AnyCompany attrition prediction dataset:

Lab 1 StepHCM Equivalent
Strip whitespace" Hyderabad ""Hyderabad" (location from HRIS export)
Encode target"Left"1, "Stayed"0 (binary attrition label)
Handle missingNew hires with blank manager_id → impute with “Unassigned”
Remove outliersOne-time $500K bonus flagged as outlier in monthly salary column
Ordinal encodejob_level: IC1 < IC2 < IC3 < M1 < M2 < VP
One-hot encodedepartment: Engineering, Payroll, HR, Sales (no order)
Move target to col 0XGBoost expects left_company as first column

Real-world impact: AnyCompany processes payroll for millions of workers across multiple countries. A single inconsistent string format (e.g., "India" vs " India" vs "IN") can split one country into three categories, causing the model to underperform on that segment. The strip transform prevents this silently.

🎭 Deep Dive: Encode Categorical — Why education, education_num & occupation? (Tasks 3.4 & 3.5)
The Core Problem

ML algorithms do math — they multiply, add, and compare numbers. They cannot process strings like "Bachelors" or "Private". Every categorical column must be converted to numbers before training. The question is how you convert them, because the wrong method introduces false relationships.

Two Encoding Methods
MethodOutputWhen to UseDanger If Misused
OrdinalSingle column: 0, 1, 2, 3…Categories have a meaningful rankModel thinks “3 is better than 1” — wrong if no real order exists
One-HotN columns, each 0 or 1Categories are equal, no rankingExplodes dimensionality (41 countries = 41 columns)
Why education Gets Ordinal Encoding

Education levels have a natural hierarchy — a Doctorate represents more education than a high school diploma. The model should know that Doctorate > Masters > Bachelors > HS-grad.

Original StringOrdinal ValueInterpretation
Preschool0Lowest education
HS-grad4High school
Some-college5Partial college
Bachelors7Undergraduate degree
Masters8Graduate degree
Doctorate9Highest education

Result: The model can now learn “higher education number → higher income probability” as a smooth relationship, which is exactly how income works in reality.

Why education_num Gets Ordinal Encoding

This column is already numeric (values 1–16), but Data Wrangler’s ordinal encoder resets it to start at 0. This is a normalization step — some algorithms perform better when features start at 0 rather than an arbitrary number. It also ensures the encoding is consistent with the education column.

Why occupation Gets Ordinal Encoding

This is the debatable choice in the lab. Occupations like "Exec-managerial", "Prof-specialty", "Handlers-cleaners" don’t have an obvious rank. However:

ReasonExplanation
Dimensionality14 unique occupations → one-hot creates 14 columns. Ordinal keeps it as 1 column.
Income correlationOccupations do have an implicit income hierarchy (Exec > Sales > Handlers). The ordinal encoder assigns values that XGBoost can split on.
Tree-based modelsXGBoost uses threshold splits (e.g., “if occupation ≤ 5”). It can learn to group occupations by splitting at different thresholds, even if the ordering isn’t perfect.

Key insight: For tree-based models (XGBoost, Random Forest), ordinal encoding works even when the order isn’t perfect — the tree just learns multiple splits. For linear models (Logistic Regression), this would be wrong because it assumes a linear relationship between the integer and the target.

Why sex, race, workclass, marital_status, relationship Get One-Hot Encoding

These columns have no meaningful order. If you assigned integers (Male=0, Female=1), the model would interpret Female as “greater than” Male — which is meaningless. One-hot encoding creates separate binary columns so the model treats each category independently:

Lab 1 ColumnValuesWhy One-Hot (not Ordinal)
marital_statusMarried, Divorced, Never-married, Separated, WidowedLife states, not levels — “Divorced” isn’t “more” than “Married”
raceWhite, Black, Asian-Pac-Islander, Amer-Indian, OtherNo ranking exists — ordinal would imply one race is “higher”
relationshipHusband, Wife, Own-child, Not-in-family, UnmarriedFamily roles aren’t ordered — “Husband” isn’t “more” than “Own-child”
sexMale, FemaleBinary but no rank — Male=0/Female=1 implies Female > Male
workclassPrivate, Self-emp, Federal-gov, Local-gov, State-govEmployment types aren’t ranked — government isn’t “more” than private

What one-hot actually does: Each unique value becomes its own column with 0 or 1. If workclass has 5 values, it becomes 5 new columns: workclass_Private, workclass_Self-emp, workclass_Federal-gov, etc. Only one column is “1” per row — the rest are “0”.

Before & After: What the Data Looks Like
❌ Before Encoding (strings — model can’t use)
educationsexworkclass
BachelorsMalePrivate
HS-gradFemaleFederal-gov
MastersMaleSelf-emp
✅ After Encoding (numbers — model can process)
education
(ordinal)
sex_Male
(one-hot)
sex_Female
(one-hot)
workclass_Private
(one-hot)
workclass_Federal-gov
(one-hot)
71010
40101
81000
🏢 Mapping to AnyCompany Attrition Dataset (anycompany_attrition_data.csv)

Our synthetic HCM dataset has 21 columns for 2,000 employees. Here’s how each categorical column maps to the same encoding decisions from Lab 1:

AnyCompany ColumnUnique ValuesEncodingLab 1 EquivalentRationale
departmentEngineering, Finance, HR, Legal, Marketing, Operations, Product, Sales (8)One-HotworkclassNo ranking — Engineering isn’t “more” than HR. Creates 8 binary columns.
locationHyderabad, Pune, Chennai, Bengaluru, London, Roseland NJ, Singapore (7)One-Hotrace / relationshipNo ranking — Hyderabad isn’t “higher” than Pune. Creates 7 binary columns.
work_modeOffice, Hybrid, Remote (3)One-HotsexNo ranking — Remote isn’t “more” than Office. Creates 3 binary columns.
badge_colorBlue, Green, Red, White, Yellow (5)DROPfnlwgtCompletely random (ROC ~0.5) — no predictive value. Don’t encode, just drop.
levelIC1, IC2, IC3, IC4, IC5, M1, M2, M3 (8)OrdinaleducationClear hierarchy: IC1 < IC2 < IC3 < IC4 < IC5 < M1 < M2 < M3
educationHigh School, Bachelors, Masters, PhD (4)OrdinaleducationSame as Lab 1! HS < Bachelors < Masters < PhD
exit_interview_scheduledNo, Pending, Yes (3)DROP🚨 TARGET LEAKAGE! Only “Yes/Pending” for leavers. Would give ROC ~1.0.
employee_idEMP-00001 to EMP-02000 (2000)DROPIdentifier, not a feature. One-hot = 2000 columns (overfitting disaster).
🔍 Before & After: AnyCompany Dataset
❌ Before (raw CSV — 3 sample rows)
departmentlevellocationwork_modeeducation
HRIC2HyderabadOfficeBachelors
EngineeringIC4LondonRemoteMasters
SalesM1PuneHybridPhD
✅ After Encoding (model-ready)
dept_HRdept_Engdept_Saleslevel
(ordinal)
loc_Hydloc_Lonloc_Punewm_Officeeducation
(ordinal)
100110011
010301002
001500103

Column explosion: The AnyCompany dataset goes from 21 columns to ~30+ after one-hot encoding. department alone creates 8 new columns. This is why we drop useless columns first (badge_color, employee_id, exit_interview_scheduled) — to avoid encoding columns that add no value.

📊 Why This Matters for XGBoost

XGBoost builds decision trees by asking yes/no questions about features:

EncodingHow XGBoost Uses ItExample Split
Ordinal (level)Threshold split: “Is level ≤ 4?” (separates ICs from Managers)IC1-IC5 go left, M1-M3 go right → managers leave less
One-Hot (dept_Engineering)Binary split: “Is dept_Engineering = 1?”Engineers go left, everyone else goes right → engineers leave more

The decision rule: Ask yourself “Is category A more than category B?”
✅ Yes (IC3 > IC1, Masters > Bachelors, M2 > M1) → Ordinal
❌ No (Engineering ≠ “more than” HR, Hyderabad ≠ “higher than” Pune) → One-Hot

⚠ Common Mistakes
MistakeWhat HappensFix
Ordinal encode departmentModel thinks Engineering(0) < HR(1) < Sales(2) — learns false “Sales is highest”Use one-hot instead
One-hot encode job_levelModel loses the hierarchy — treats VP and IC1 as equally different from IC2Use ordinal instead
One-hot encode high-cardinality column (e.g., employee_id)Creates 10,000+ sparse columns, model overfits to individual employeesDrop the column entirely
Deep Dive: Why Amazon EMR & Apache Spark? (Tasks 5–6)
What Problem Does EMR Solve?

In Tasks 1–3, you used SageMaker Data Wrangler — a visual, low-code tool for exploring and transforming data. It works great for prototyping on small datasets. But in production, you need to process millions of records programmatically, with version-controlled code, on a scalable cluster. That’s where Amazon EMR comes in.

The lab teaches both approaches: Data Wrangler for quick visual exploration (Tasks 1–3) and Spark on EMR for scalable, code-first processing (Tasks 5–6). In practice, you prototype in Data Wrangler, then implement the production pipeline in Spark.

The Big Data Stack (3 Layers)
LayerTechnologyRoleAnalogy
TopApache Spark (PySpark)Processing engine — distributes work across machines, keeps data in memoryThe workers who do the actual computation
MiddleApache HiveSQL catalog — knows where tables live and their schemaThe filing system that says “adult_data is in drawer 3”
BottomHadoop (YARN + HDFS/S3)Resource manager + storage — decides which machine runs whatThe office building and its room assignments

Amazon EMR is the AWS managed service that runs this entire stack for you — no server setup, no cluster configuration, auto-scaling built in.

Apache Spark in 30 Seconds

Spark’s key innovation: it keeps data in memory between processing steps (100x faster than Hadoop MapReduce, which writes to disk after every step). You write Python (PySpark), Spark distributes the work across the cluster automatically.

ConceptWhat It MeansLab 1 Example
DataFrameLike a Pandas DataFrame, but split across multiple machinesadult_df = sqlContext.sql("select * from adult_data")
TransformationA lazy operation (builds a plan, doesn’t execute yet)StringIndexer, OneHotEncoder, VectorAssembler
ActionTriggers actual computation across the clusteradult_df.count(), .show(), .toPandas()
PipelineChain of transformers applied in sequence (reproducible)Pipeline(stages=indexers + [encoder, assembler])
Apache Hive — The SQL Catalog

Hive provides a metadata layer so you can query files using SQL. Instead of writing code to parse CSV files from S3, you register a table once and query it with familiar SQL:

Without HiveWith Hive
spark.read.csv("s3://bucket/path/adult.csv", header=True, schema=...)sqlContext.sql("select * from adult_data")

In Lab 1, the EMR cluster has the adult dataset pre-registered as a Hive table. That’s why show tables returns adult_data.

What Task 6 Actually Does

Task 6 performs the same feature engineering you did visually in Data Wrangler (Task 3), but now as reproducible code:

StepSpark CodeData Wrangler Equivalent
Load datasqlContext.sql("select * from adult_data")Import CSV into Data Wrangler
Explore shapeadult_df.count() → 1000 rowsData Quality & Insights Report
Encode categoricalsStringIndexer + OneHotEncoderOrdinal / One-Hot encode transforms
Assemble featuresVectorAssembler → single feature vectorExport step (auto-combines columns)
Encode targetStringIndexer(inputCol='income')Search & Replace (≤50K→1, >50K→0)
Why Both Approaches in One Lab?
ApproachBest ForLimitation
Data Wrangler (visual)Quick EDA, prototyping transforms, target leakage checksDoesn’t scale to millions of rows; not version-controlled
PySpark on EMR (code)Production pipelines, large datasets, CI/CD integrationRequires coding; slower to iterate during exploration
HCM Scale Analogy

Why this matters at scale: Imagine processing payroll records for millions of workers across multiple countries. Data Wrangler can handle a 10,000-row sample for exploration. But the production feature engineering pipeline — joining employee records with tax tables, encoding job levels, computing tenure features — needs Spark on EMR (or AWS Glue, which is serverless Spark) to process the full dataset in minutes rather than hours.

From Prototype to Production: Data Wrangler Export

You don’t have to manually rewrite your Data Wrangler transforms as Spark code. Data Wrangler has built-in export options that generate code from your visual flow:

Export TargetWhat You GetWhen to Use
SageMaker Processing JobPython/PySpark script that runs your exact transforms on managed infrastructureScheduled batch processing (e.g., nightly payroll feature refresh)
SageMaker PipelineYour flow as a step in an MLOps pipelineEnd-to-end automation (data prep → train → deploy)
Python NotebookEditable .ipynb with all transforms as codeCustomize logic, add error handling, integrate with EMR/Glue
Feature StorePushes transformed features directly to SageMaker Feature StoreReusable features shared across multiple models

Real workflow: Prototype transforms visually in Data Wrangler → Export as Python/PySpark code → Customize (add error handling, parameterize S3 paths) → Deploy as a scheduled Spark job on EMR or Glue. You never rewrite from scratch.

Note: Lab 1 doesn’t demonstrate this export path — it has you do both manually to teach the underlying concepts. In a real project, you’d use Data Wrangler’s export as your starting point.

Key Takeaway

You don’t need to become a Spark expert for this course. The important concept is: Data Wrangler = prototype, Spark = production. The lab shows you both so you understand the full workflow from exploration to scalable implementation. And in practice, Data Wrangler bridges the gap by exporting your visual transforms as production-ready code.

SageMaker Data WranglerSageMaker CanvasSageMaker StudioAmazon EMRApache SparkAmazon S3
2
Data Processing Using SageMaker Processing and the SageMaker Python SDK
⏱ 15 minutes • Data Processing
🚀 Open Interactive Explainer →
Raw CSV in S3
SageMaker Processing
Processed data in S3
Scenario: Explore an alternative to Data Wrangler — run Spark-based processing scripts programmatically using the SageMaker Python SDK. Same transforms, but serverless and schedulable.

📝 Task Flow

Set Up Environment
Launch SageMaker Studio, clone lab2repo, open JupyterLab
Run Processing Job
Execute Spark ML processing via SageMaker Processing containers — serverless, no cluster management
SageMaker StudioSageMaker ProcessingSpark ML ContainerAmazon S3
3
Training a Model with Amazon SageMaker
⏱ 25 minutes • Model Training
🚀 Open Interactive Explainer →
Train/Val CSVs
XGBoost Training
Model artifact (.tar.gz)
Scenario: Train an XGBoost binary classifier to predict income ≤$50K. Configure hyperparameters, run a training job on managed infrastructure, evaluate accuracy.

📝 Task Flow

Set Up Environment
Launch SageMaker Studio, clone Lab3Repository, select Python 3 kernel
Train a Model
Configure XGBoost estimator, run training job, evaluate results, verify model artifacts in S3
SageMaker StudioSageMaker TrainingXGBoostAmazon ECRAmazon S3
4
Model Tuning and Hyperparameter Optimization
⏱ 30 minutes • Model Evaluation & Tuning
Base model
HPO (parallel jobs)
Best model selected
Scenario: Improve model accuracy by running automatic hyperparameter tuning. SageMaker runs multiple training jobs in parallel, each with different hyperparameter combinations, and selects the best performer.

📝 Task Flow

Set Up Environment
Launch SageMaker Studio, clone Lab4Repository, select Python 3 kernel
Tune a Model
Define hyperparameter ranges, configure tuning job, run optimization, compare model variants
SageMaker StudioSageMaker Automatic Model TuningXGBoostAmazon S3
5
Shifting Traffic (Blue/Green Deployment)
⏱ 45 minutes • Model Deployment Strategies
Old model (Blue)
Linear shift (25%→50%→100%)
New model (Green)
Scenario: Deploy a new model using Blue/Green with linear traffic shifting. Gradually move traffic while CloudWatch alarms monitor for errors — auto-rollback if something goes wrong.

📝 Task Flow

Set Up Environment
Launch SageMaker Studio, clone Lab5Repository, select Python 3 kernel
Linear Traffic Shifting
Create endpoint, configure CloudWatch alarm, implement linear traffic shifting with auto-rollback
SageMaker EndpointsBlue/Green DeploymentAmazon CloudWatchAmazon S3
6
Orchestrate ML Workflow using SageMaker Pipelines and Model Registry
⏱ 1 hour 30 minutes • MLOps & Automation
Raw data
Pipeline (process→train→eval)
Model Registry (versioned)
Scenario: Build an end-to-end automated ML pipeline for customer churn prediction. Orchestrate processing, training, evaluation, and model registration — all triggered with a single API call.

📝 Task Flow

Set Up Environment
Launch SageMaker Studio, clone Lab6Repository, select Python 3 kernel
Create & Monitor Pipeline
Define pipeline steps (processing, training, evaluation, registration), run pipeline, explore artifacts in Studio
SageMaker PipelinesSageMaker Model RegistryAmazon S3
7
Monitor a Model for Data Drift
⏱ 45 minutes • Model Monitoring & Data Quality
Monitor detects drift
Alarm→SNS→Lambda
Auto-retrain & redeploy
Scenario: A production model starts receiving data that looks different from training data (drift). Set up automated monitoring that detects drift and triggers retraining without human intervention.

📝 Task Flow

Set Up Environment
Launch SageMaker Studio, clone Lab7Repository, select Python 3 kernel
Model Monitoring
Create endpoint with data capture, generate baseline, create Model Monitor job, set up CloudWatch alarms
Review Auto-Retraining
CloudWatch alarm → SNS → Lambda → Step Functions (retrain, create model, update endpoint)
SageMaker Model MonitorSageMaker EndpointsAmazon CloudWatchAmazon SNSAWS LambdaAWS Step FunctionsAmazon S3
🔁
ML Lifecycle Coverage

📊 Data Preparation

Visual data wrangling, Spark processing at scale, feature engineering with encoding techniques

Labs 1 & 2

🧠 Model Training

XGBoost training jobs, algorithm configuration, model artifact management

Lab 3

🎛 Model Tuning

Automatic hyperparameter optimization, parallel training runs, model selection

Lab 4

🚀 Model Deployment

Blue/Green deployment, linear traffic shifting, CloudWatch alarm guardrails

Lab 5

⚙ ML Orchestration

SageMaker Pipelines workflow, Model Registry versioning, automated ML workflows

Lab 6

📡 Model Monitoring

Data drift detection, quality baselines, automated retraining triggers via Step Functions

Lab 7