Hands-On Lab Guide — ML Engineering on AWS

1

Analyze and Prepare Data with SageMaker Data Wrangler and Amazon EMR

⏱ 1 hour 15 minutes • Data Preparation & Feature Engineering

▼

Scenario: A citizen advocacy group needs to predict whether individuals earn less than $50K/year to target government assistance promotions. You prepare and transform demographic data using SageMaker Data Wrangler and Apache Spark on Amazon EMR.

🎯 Objectives

Choose effective methods for visualizing data
Process missing values, outliers, and duplicated data
Define key encoding techniques (ordinal, one-hot)
Ingest and transform data in SageMaker Data Wrangler
Transform data using Spark on Amazon EMR

📝 Task Flow

1

Import & Preliminary Analysis

Import adult_data.csv into Data Wrangler, generate Data Quality & Insights Report

2

Analyze & Visualize Data

Generate histograms, check for target leakage using analysis

3

Data Transformations & Export

Drop low-predictive columns, handle missing values, remove outliers, encode categoricals, split 70/20/10, export to S3

4

Set Up SageMaker Studio

Launch JupyterLab workspace, clone Git repository

5

Connect to EMR Cluster

Discover and connect to EMR cluster using SparkMagic PySpark kernel

6

Explore & Query with Spark

Run exploratory data analysis using Apache Spark on EMR

💡 Deep Dive: Why Target Leakage Analysis? (Task 2) ▼

What is Target Leakage?

Target leakage occurs when a feature in your training data is strongly correlated with the target label but would not be available at prediction time in the real world. It is essentially “cheating” — the model gets information that leaks the answer, scores perfectly in training, but fails in production.

How Data Wrangler Detects It

SageMaker Data Wrangler computes the ROC score for each feature column individually via cross-validation. The ROC score tells you how well that single feature alone can predict the target:

ROC Score	Meaning	Action
0.5	Zero predictive power (coin flip)	Drop — adds noise without helping
0.6 – 0.8	Normal predictive power	Keep — good features
0.9 – 1.0	Suspiciously high	Investigate — possible leakage!

What Lab 1 Finds in the Adult Dataset

The adult_data.csv has 14 features predicting income (≤$50K or >$50K). The target leakage report reveals:

Feature	ROC	Verdict
education_num	~0.7	✅ Good predictor — keep
marital_status	~0.7	✅ Good predictor — keep
hours_per_week	~0.65	✅ Moderate — keep
fnlwgt	~0.5	❌ Useless — drop
native_country	~0.5	❌ Useless — drop

Result: No target leakage detected (no feature near 1.0). But fnlwgt (census sampling weight) and native_country are dropped because ROC = 0.5 means they contribute nothing — just noise that slows training.

Why This Matters

Prevents false confidence: A leaked feature gives 99% accuracy in training but 50% in production. Target leakage analysis catches this before you waste weeks on a model that will fail when deployed.

Removes noise: Features with ROC = 0.5 add computational cost without improving predictions. Dropping them makes training faster and can improve accuracy.

HCM Analogy

Imagine building an attrition prediction model (will this employee leave?):

Feature	ROC	Problem
exit_interview_date	~1.0	🚨 Leakage! If populated, they already left
badge_color	~0.5	❌ Useless — no correlation
months_since_promotion	~0.7	✅ Genuinely predictive — keep

🔧 Deep Dive: Data Transformations & Export (Task 3) ▼

Why Transform Before Training?

ML algorithms expect clean, numeric, consistently formatted data. Raw datasets from production systems contain messy strings, missing values, outliers, and mixed formats. Task 3 applies a transformation pipeline that converts raw census data into model-ready features.

🗺 Transformation Pipeline (8 Steps)

The complete sequence of transforms applied in Data Wrangler, in order:

🔴 Step 5: Target Encoding — Why ≤50K = 1 and >50K = 0?

The income column contains string labels: "<=50K" and ">50K". XGBoost (and most ML algorithms) require numeric targets for binary classification. Data Wrangler’s Search and Edit → Find and Replace converts them:

Original Value	Replaced With	Meaning
`<=50K`	1	Positive class (eligible for assistance)
`>50K`	0	Negative class (not eligible)

Why this encoding? The advocacy group wants to find people who earn ≤$50K. Making that the positive class (1) means the model’s recall metric directly measures “what % of eligible people did we correctly identify?” — the metric the business cares about most.

XGBoost requirement: The target column must also be the first column in the CSV (Step 9). XGBoost’s built-in SageMaker container reads column 0 as the label by default — no header row, no column name mapping. If you forget this step, the model trains on the wrong column silently.

🟢 Step 4: Format String — Strip Left & Right

The adult dataset has categorical columns with inconsistent whitespace — values like " Private", "Private ", and " Private " would be treated as three different categories without cleaning.

Operation	What It Does	Example
Strip left/right	Remove leading & trailing whitespace	`" Private " → "Private"`
Applied to	`education`, `income`, `marital_status`, `occupation`, `race`, `relationship`, `sex`, `workclass`

Why it matters: Without stripping, one-hot encoding creates duplicate columns for " Private" vs "Private" — inflating dimensionality and confusing the model. Also, the target encoding (Step 5) would fail: " <=50K" wouldn’t match the pattern "<=50K", leaving string values in the target column and crashing training.

🔵 Steps 1–3: Cleaning (Drop, Missing, Duplicates)

Step	Action	Rationale
1. Drop columns	Remove `fnlwgt`, `native_country`	ROC = 0.5 (zero predictive power, just noise)
2. Drop missing	Remove rows where `occupation` or `workclass` = `"?"`	~6% of rows; random pattern, safe to drop
3. Drop duplicates	Remove identical rows	Prevents model from memorizing repeated examples

🟠 Step 6: Handle Outliers

The capital_gain column has extreme values (e.g., 99,999) that represent rare stock sales. These outliers skew gradient calculations during training. Data Wrangler removes values > 80,000 or < 0 using Min-Max numeric outliers with the Remove fix method.

🟣 Steps 7–8: Encoding Categoricals

Encoding	When to Use	Lab 1 Columns
Ordinal	Natural order exists	`education` (HS < Bachelors < Masters < PhD), `education_num`, `occupation`
One-Hot	No inherent order	`marital_status`, `race`, `relationship`, `sex`, `workclass`

Trap: One-hot encoding native_country (41 unique values) would create 41 sparse columns. Since ROC = 0.5 (useless), we dropped it in Step 1 before encoding — saving 41 dimensions and speeding up training.

🟢 Steps 9–11: Finalize & Export

Step	Action	Why
9. Move target	Move `income` to column 0	XGBoost reads first column as label (no header)
10. Split	70% train / 20% test / 10% validation	Separate data for learning, tuning, and final eval
11. Export	3 CSVs to S3 (`train/`, `test/`, `validation/`)	Ready for SageMaker Training Job in Lab 3

HCM Analogy

Imagine preparing an AnyCompany attrition prediction dataset:

Lab 1 Step	HCM Equivalent
Strip whitespace	`" Hyderabad "` → `"Hyderabad"` (location from HRIS export)
Encode target	`"Left"` → `1`, `"Stayed"` → `0` (binary attrition label)
Handle missing	New hires with blank `manager_id` → impute with “Unassigned”
Remove outliers	One-time $500K bonus flagged as outlier in monthly salary column
Ordinal encode	`job_level`: IC1 < IC2 < IC3 < M1 < M2 < VP
One-hot encode	`department`: Engineering, Payroll, HR, Sales (no order)
Move target to col 0	XGBoost expects `left_company` as first column

Real-world impact: AnyCompany processes payroll for millions of workers across multiple countries. A single inconsistent string format (e.g., "India" vs " India" vs "IN") can split one country into three categories, causing the model to underperform on that segment. The strip transform prevents this silently.

🎭 Deep Dive: Encode Categorical — Why education, education_num & occupation? (Tasks 3.4 & 3.5) ▼

The Core Problem

ML algorithms do math — they multiply, add, and compare numbers. They cannot process strings like "Bachelors" or "Private". Every categorical column must be converted to numbers before training. The question is how you convert them, because the wrong method introduces false relationships.

Two Encoding Methods

Method	Output	When to Use	Danger If Misused
Ordinal	Single column: 0, 1, 2, 3…	Categories have a meaningful rank	Model thinks “3 is better than 1” — wrong if no real order exists
One-Hot	N columns, each 0 or 1	Categories are equal, no ranking	Explodes dimensionality (41 countries = 41 columns)

Why `education` Gets Ordinal Encoding

Education levels have a natural hierarchy — a Doctorate represents more education than a high school diploma. The model should know that Doctorate > Masters > Bachelors > HS-grad.

Original String	Ordinal Value	Interpretation
`Preschool`	0	Lowest education
`HS-grad`	4	High school
`Some-college`	5	Partial college
`Bachelors`	7	Undergraduate degree
`Masters`	8	Graduate degree
`Doctorate`	9	Highest education

Result: The model can now learn “higher education number → higher income probability” as a smooth relationship, which is exactly how income works in reality.

Why `education_num` Gets Ordinal Encoding

This column is already numeric (values 1–16), but Data Wrangler’s ordinal encoder resets it to start at 0. This is a normalization step — some algorithms perform better when features start at 0 rather than an arbitrary number. It also ensures the encoding is consistent with the education column.

Why `occupation` Gets Ordinal Encoding

This is the debatable choice in the lab. Occupations like "Exec-managerial", "Prof-specialty", "Handlers-cleaners" don’t have an obvious rank. However:

Reason	Explanation
Dimensionality	14 unique occupations → one-hot creates 14 columns. Ordinal keeps it as 1 column.
Income correlation	Occupations do have an implicit income hierarchy (Exec > Sales > Handlers). The ordinal encoder assigns values that XGBoost can split on.
Tree-based models	XGBoost uses threshold splits (e.g., “if occupation ≤ 5”). It can learn to group occupations by splitting at different thresholds, even if the ordering isn’t perfect.

Key insight: For tree-based models (XGBoost, Random Forest), ordinal encoding works even when the order isn’t perfect — the tree just learns multiple splits. For linear models (Logistic Regression), this would be wrong because it assumes a linear relationship between the integer and the target.

Why `sex`, `race`, `workclass`, `marital_status`, `relationship` Get One-Hot Encoding

These columns have no meaningful order. If you assigned integers (Male=0, Female=1), the model would interpret Female as “greater than” Male — which is meaningless. One-hot encoding creates separate binary columns so the model treats each category independently:

Lab 1 Column	Values	Why One-Hot (not Ordinal)
`marital_status`	Married, Divorced, Never-married, Separated, Widowed	Life states, not levels — “Divorced” isn’t “more” than “Married”
`race`	White, Black, Asian-Pac-Islander, Amer-Indian, Other	No ranking exists — ordinal would imply one race is “higher”
`relationship`	Husband, Wife, Own-child, Not-in-family, Unmarried	Family roles aren’t ordered — “Husband” isn’t “more” than “Own-child”
`sex`	Male, Female	Binary but no rank — Male=0/Female=1 implies Female > Male
`workclass`	Private, Self-emp, Federal-gov, Local-gov, State-gov	Employment types aren’t ranked — government isn’t “more” than private

What one-hot actually does: Each unique value becomes its own column with 0 or 1. If workclass has 5 values, it becomes 5 new columns: workclass_Private, workclass_Self-emp, workclass_Federal-gov, etc. Only one column is “1” per row — the rest are “0”.

Before & After: What the Data Looks Like

❌ Before Encoding (strings — model can’t use)
education	sex	workclass
Bachelors	Male	Private
HS-grad	Female	Federal-gov
Masters	Male	Self-emp

✅ After Encoding (numbers — model can process)
education (ordinal)	sex_Male (one-hot)	sex_Female (one-hot)	workclass_Private (one-hot)	workclass_Federal-gov (one-hot)
7	1	0	1	0
4	0	1	0	1
8	1	0	0	0

🏢 Mapping to AnyCompany Attrition Dataset (`anycompany_attrition_data.csv`)

Our synthetic HCM dataset has 21 columns for 2,000 employees. Here’s how each categorical column maps to the same encoding decisions from Lab 1:

AnyCompany Column	Unique Values	Encoding	Lab 1 Equivalent	Rationale
`department`	Engineering, Finance, HR, Legal, Marketing, Operations, Product, Sales (8)	One-Hot	`workclass`	No ranking — Engineering isn’t “more” than HR. Creates 8 binary columns.
`location`	Hyderabad, Pune, Chennai, Bengaluru, London, Roseland NJ, Singapore (7)	One-Hot	`race` / `relationship`	No ranking — Hyderabad isn’t “higher” than Pune. Creates 7 binary columns.
`work_mode`	Office, Hybrid, Remote (3)	One-Hot	`sex`	No ranking — Remote isn’t “more” than Office. Creates 3 binary columns.
`badge_color`	Blue, Green, Red, White, Yellow (5)	DROP	`fnlwgt`	Completely random (ROC ~0.5) — no predictive value. Don’t encode, just drop.
`level`	IC1, IC2, IC3, IC4, IC5, M1, M2, M3 (8)	Ordinal	`education`	Clear hierarchy: IC1 < IC2 < IC3 < IC4 < IC5 < M1 < M2 < M3
`education`	High School, Bachelors, Masters, PhD (4)	Ordinal	`education`	Same as Lab 1! HS < Bachelors < Masters < PhD
`exit_interview_scheduled`	No, Pending, Yes (3)	DROP	—	🚨 TARGET LEAKAGE! Only “Yes/Pending” for leavers. Would give ROC ~1.0.
`employee_id`	EMP-00001 to EMP-02000 (2000)	DROP	—	Identifier, not a feature. One-hot = 2000 columns (overfitting disaster).

🔍 Before & After: AnyCompany Dataset

❌ Before (raw CSV — 3 sample rows)
department	level	location	work_mode	education
HR	IC2	Hyderabad	Office	Bachelors
Engineering	IC4	London	Remote	Masters
Sales	M1	Pune	Hybrid	PhD

✅ After Encoding (model-ready)
dept_HR	dept_Eng	dept_Sales	level (ordinal)	loc_Hyd	loc_Lon	loc_Pune	wm_Office	education (ordinal)
1	0	0	1	1	0	0	1	1
0	1	0	3	0	1	0	0	2
0	0	1	5	0	0	1	0	3

Column explosion: The AnyCompany dataset goes from 21 columns to ~30+ after one-hot encoding. department alone creates 8 new columns. This is why we drop useless columns first (badge_color, employee_id, exit_interview_scheduled) — to avoid encoding columns that add no value.

📊 Why This Matters for XGBoost

XGBoost builds decision trees by asking yes/no questions about features:

Encoding	How XGBoost Uses It	Example Split
Ordinal (`level`)	Threshold split: “Is level ≤ 4?” (separates ICs from Managers)	IC1-IC5 go left, M1-M3 go right → managers leave less
One-Hot (`dept_Engineering`)	Binary split: “Is dept_Engineering = 1?”	Engineers go left, everyone else goes right → engineers leave more

The decision rule: Ask yourself “Is category A more than category B?”
✅ Yes (IC3 > IC1, Masters > Bachelors, M2 > M1) → Ordinal
❌ No (Engineering ≠ “more than” HR, Hyderabad ≠ “higher than” Pune) → One-Hot

⚠ Common Mistakes

Mistake	What Happens	Fix
Ordinal encode `department`	Model thinks Engineering(0) < HR(1) < Sales(2) — learns false “Sales is highest”	Use one-hot instead
One-hot encode `job_level`	Model loses the hierarchy — treats VP and IC1 as equally different from IC2	Use ordinal instead
One-hot encode high-cardinality column (e.g., `employee_id`)	Creates 10,000+ sparse columns, model overfits to individual employees	Drop the column entirely

☁ Deep Dive: Why Amazon EMR & Apache Spark? (Tasks 5–6) ▼

What Problem Does EMR Solve?

In Tasks 1–3, you used SageMaker Data Wrangler — a visual, low-code tool for exploring and transforming data. It works great for prototyping on small datasets. But in production, you need to process millions of records programmatically, with version-controlled code, on a scalable cluster. That’s where Amazon EMR comes in.

The lab teaches both approaches: Data Wrangler for quick visual exploration (Tasks 1–3) and Spark on EMR for scalable, code-first processing (Tasks 5–6). In practice, you prototype in Data Wrangler, then implement the production pipeline in Spark.

The Big Data Stack (3 Layers)

Layer	Technology	Role	Analogy
Top	Apache Spark (PySpark)	Processing engine — distributes work across machines, keeps data in memory	The workers who do the actual computation
Middle	Apache Hive	SQL catalog — knows where tables live and their schema	The filing system that says “adult_data is in drawer 3”
Bottom	Hadoop (YARN + HDFS/S3)	Resource manager + storage — decides which machine runs what	The office building and its room assignments

Amazon EMR is the AWS managed service that runs this entire stack for you — no server setup, no cluster configuration, auto-scaling built in.

Apache Spark in 30 Seconds

Spark’s key innovation: it keeps data in memory between processing steps (100x faster than Hadoop MapReduce, which writes to disk after every step). You write Python (PySpark), Spark distributes the work across the cluster automatically.

Concept	What It Means	Lab 1 Example
DataFrame	Like a Pandas DataFrame, but split across multiple machines	`adult_df = sqlContext.sql("select * from adult_data")`
Transformation	A lazy operation (builds a plan, doesn’t execute yet)	`StringIndexer`, `OneHotEncoder`, `VectorAssembler`
Action	Triggers actual computation across the cluster	`adult_df.count()`, `.show()`, `.toPandas()`
Pipeline	Chain of transformers applied in sequence (reproducible)	`Pipeline(stages=indexers + [encoder, assembler])`

Apache Hive — The SQL Catalog

Hive provides a metadata layer so you can query files using SQL. Instead of writing code to parse CSV files from S3, you register a table once and query it with familiar SQL:

Without Hive	With Hive
`spark.read.csv("s3://bucket/path/adult.csv", header=True, schema=...)`	`sqlContext.sql("select * from adult_data")`

In Lab 1, the EMR cluster has the adult dataset pre-registered as a Hive table. That’s why show tables returns adult_data.

What Task 6 Actually Does

Task 6 performs the same feature engineering you did visually in Data Wrangler (Task 3), but now as reproducible code:

Step	Spark Code	Data Wrangler Equivalent
Load data	`sqlContext.sql("select * from adult_data")`	Import CSV into Data Wrangler
Explore shape	`adult_df.count()` → 1000 rows	Data Quality & Insights Report
Encode categoricals	`StringIndexer` + `OneHotEncoder`	Ordinal / One-Hot encode transforms
Assemble features	`VectorAssembler` → single feature vector	Export step (auto-combines columns)
Encode target	`StringIndexer(inputCol='income')`	Search & Replace (≤50K→1, >50K→0)

Why Both Approaches in One Lab?

Approach	Best For	Limitation
Data Wrangler (visual)	Quick EDA, prototyping transforms, target leakage checks	Doesn’t scale to millions of rows; not version-controlled
PySpark on EMR (code)	Production pipelines, large datasets, CI/CD integration	Requires coding; slower to iterate during exploration

HCM Scale Analogy

Why this matters at scale: Imagine processing payroll records for millions of workers across multiple countries. Data Wrangler can handle a 10,000-row sample for exploration. But the production feature engineering pipeline — joining employee records with tax tables, encoding job levels, computing tenure features — needs Spark on EMR (or AWS Glue, which is serverless Spark) to process the full dataset in minutes rather than hours.

From Prototype to Production: Data Wrangler Export

You don’t have to manually rewrite your Data Wrangler transforms as Spark code. Data Wrangler has built-in export options that generate code from your visual flow:

Export Target	What You Get	When to Use
SageMaker Processing Job	Python/PySpark script that runs your exact transforms on managed infrastructure	Scheduled batch processing (e.g., nightly payroll feature refresh)
SageMaker Pipeline	Your flow as a step in an MLOps pipeline	End-to-end automation (data prep → train → deploy)
Python Notebook	Editable `.ipynb` with all transforms as code	Customize logic, add error handling, integrate with EMR/Glue
Feature Store	Pushes transformed features directly to SageMaker Feature Store	Reusable features shared across multiple models

Real workflow: Prototype transforms visually in Data Wrangler → Export as Python/PySpark code → Customize (add error handling, parameterize S3 paths) → Deploy as a scheduled Spark job on EMR or Glue. You never rewrite from scratch.

Note: Lab 1 doesn’t demonstrate this export path — it has you do both manually to teach the underlying concepts. In a real project, you’d use Data Wrangler’s export as your starting point.

Key Takeaway

You don’t need to become a Spark expert for this course. The important concept is: Data Wrangler = prototype, Spark = production. The lab shows you both so you understand the full workflow from exploration to scalable implementation. And in practice, Data Wrangler bridges the gap by exporting your visual transforms as production-ready code.

SageMaker Data WranglerSageMaker CanvasSageMaker StudioAmazon EMRApache SparkAmazon S3

2

Data Processing Using SageMaker Processing and the SageMaker Python SDK

⏱ 15 minutes • Data Processing

▼

🚀 Open Interactive Explainer →

Raw CSV in S3

SageMaker Processing

Processed data in S3

Scenario: Explore an alternative to Data Wrangler — run Spark-based processing scripts programmatically using the SageMaker Python SDK. Same transforms, but serverless and schedulable.

📝 Task Flow

Set Up Environment

Launch SageMaker Studio, clone lab2repo, open JupyterLab

Run Processing Job

Execute Spark ML processing via SageMaker Processing containers — serverless, no cluster management

SageMaker StudioSageMaker ProcessingSpark ML ContainerAmazon S3

3

Training a Model with Amazon SageMaker

⏱ 25 minutes • Model Training

▼

🚀 Open Interactive Explainer →

Train/Val CSVs

XGBoost Training

Model artifact (.tar.gz)

Scenario: Train an XGBoost binary classifier to predict income ≤$50K. Configure hyperparameters, run a training job on managed infrastructure, evaluate accuracy.

📝 Task Flow

Set Up Environment

Launch SageMaker Studio, clone Lab3Repository, select Python 3 kernel

Train a Model

Configure XGBoost estimator, run training job, evaluate results, verify model artifacts in S3

SageMaker StudioSageMaker TrainingXGBoostAmazon ECRAmazon S3

4

Model Tuning and Hyperparameter Optimization

⏱ 30 minutes • Model Evaluation & Tuning

▼

Base model

HPO (parallel jobs)

Best model selected

Scenario: Improve model accuracy by running automatic hyperparameter tuning. SageMaker runs multiple training jobs in parallel, each with different hyperparameter combinations, and selects the best performer.

📝 Task Flow

Set Up Environment

Launch SageMaker Studio, clone Lab4Repository, select Python 3 kernel

Tune a Model

Define hyperparameter ranges, configure tuning job, run optimization, compare model variants

SageMaker StudioSageMaker Automatic Model TuningXGBoostAmazon S3

5

Shifting Traffic (Blue/Green Deployment)

⏱ 45 minutes • Model Deployment Strategies

▼

Old model (Blue)

Linear shift (25%→50%→100%)

New model (Green)

Scenario: Deploy a new model using Blue/Green with linear traffic shifting. Gradually move traffic while CloudWatch alarms monitor for errors — auto-rollback if something goes wrong.

📝 Task Flow

Set Up Environment

Launch SageMaker Studio, clone Lab5Repository, select Python 3 kernel

Linear Traffic Shifting

Create endpoint, configure CloudWatch alarm, implement linear traffic shifting with auto-rollback

SageMaker EndpointsBlue/Green DeploymentAmazon CloudWatchAmazon S3

6

Orchestrate ML Workflow using SageMaker Pipelines and Model Registry

⏱ 1 hour 30 minutes • MLOps & Automation

▼

Raw data

Pipeline (process→train→eval)

Model Registry (versioned)

Scenario: Build an end-to-end automated ML pipeline for customer churn prediction. Orchestrate processing, training, evaluation, and model registration — all triggered with a single API call.

📝 Task Flow

Set Up Environment

Launch SageMaker Studio, clone Lab6Repository, select Python 3 kernel

Create & Monitor Pipeline

Define pipeline steps (processing, training, evaluation, registration), run pipeline, explore artifacts in Studio

SageMaker PipelinesSageMaker Model RegistryAmazon S3

7

Monitor a Model for Data Drift

⏱ 45 minutes • Model Monitoring & Data Quality

▼

Monitor detects drift

Alarm→SNS→Lambda

Auto-retrain & redeploy

Scenario: A production model starts receiving data that looks different from training data (drift). Set up automated monitoring that detects drift and triggers retraining without human intervention.

📝 Task Flow

Set Up Environment

Launch SageMaker Studio, clone Lab7Repository, select Python 3 kernel

Model Monitoring

Create endpoint with data capture, generate baseline, create Model Monitor job, set up CloudWatch alarms

Review Auto-Retraining

CloudWatch alarm → SNS → Lambda → Step Functions (retrain, create model, update endpoint)

SageMaker Model MonitorSageMaker EndpointsAmazon CloudWatchAmazon SNSAWS LambdaAWS Step FunctionsAmazon S3

🧪 Hands-On Lab Guide

🎯 Objectives

📝 Task Flow

What is Target Leakage?

How Data Wrangler Detects It

What Lab 1 Finds in the Adult Dataset

Why This Matters

HCM Analogy

Why Transform Before Training?

🗺 Transformation Pipeline (8 Steps)

🔴 Step 5: Target Encoding — Why ≤50K = 1 and >50K = 0?

🟢 Step 4: Format String — Strip Left & Right

🔵 Steps 1–3: Cleaning (Drop, Missing, Duplicates)

🟠 Step 6: Handle Outliers

🟣 Steps 7–8: Encoding Categoricals

🟢 Steps 9–11: Finalize & Export

HCM Analogy

The Core Problem

Two Encoding Methods

Why education Gets Ordinal Encoding

Why education_num Gets Ordinal Encoding

Why occupation Gets Ordinal Encoding

Why sex, race, workclass, marital_status, relationship Get One-Hot Encoding

Before & After: What the Data Looks Like

🏢 Mapping to AnyCompany Attrition Dataset (anycompany_attrition_data.csv)

🔍 Before & After: AnyCompany Dataset

📊 Why This Matters for XGBoost

⚠ Common Mistakes

What Problem Does EMR Solve?

The Big Data Stack (3 Layers)

Apache Spark in 30 Seconds

Apache Hive — The SQL Catalog

What Task 6 Actually Does

Why Both Approaches in One Lab?

HCM Scale Analogy

From Prototype to Production: Data Wrangler Export

Key Takeaway

📝 Task Flow

📝 Task Flow

📝 Task Flow

📝 Task Flow

📝 Task Flow

📝 Task Flow

📊 Data Preparation

🧠 Model Training

🎛 Model Tuning

🚀 Model Deployment

⚙ ML Orchestration

📡 Model Monitoring

Why `education` Gets Ordinal Encoding

Why `education_num` Gets Ordinal Encoding

Why `occupation` Gets Ordinal Encoding

Why `sex`, `race`, `workclass`, `marital_status`, `relationship` Get One-Hot Encoding

🏢 Mapping to AnyCompany Attrition Dataset (`anycompany_attrition_data.csv`)