Lab 1: Data Preparation & Wrangling — ML Engineering on AWS

📋

Lab Overview & Scenario

🎯 The Business Problem

A citizen advocacy group needs to predict whether individuals earn ≤$50K/year to target government assistance promotions. You have demographic data (education, employment, marital status) but only partial income labels. Build a binary classification model to identify eligible citizens.

INPUT

📂 Raw CSV

adult_data.csv

32,561 rows 14 features 1 target

Strings, missing values, outliers, whitespace issues, useless columns

11 transforms

PROCESS

🔄 Data Wrangler

+ Spark on EMR

DropCleanEncodeSplit

export to S3

OUTPUT

🏁 3 Clean CSVs

70%Train

20%Test

10%Val

All numeric, encoded, no outliers, target at col 0. Ready for XGBoost (Lab 3).

🧠

What You Learn

🔍 EDA

Data Quality Reports, histograms, target leakage detection via ROC analysis

🧹 Cleaning

Drop columns, handle missing values, remove duplicates, strip whitespace

🔄 Encoding

Ordinal encoding (ranked categories) vs One-Hot encoding (unranked categories)

✂ Outliers

Min-Max threshold removal for extreme values that skew training

✂ Split & Export

70/20/10 randomized split, export to S3 as separate CSVs

⚡ Spark on EMR

Connect SageMaker Studio to EMR cluster, run PySpark EDA at scale

Key Insight: This lab covers the most time-consuming phase of any ML project — data preparation typically consumes 60–80% of total project time. Getting this right determines whether your model succeeds or fails in production.

🛠

AWS Services Used

SageMaker Data Wrangler SageMaker Canvas SageMaker Studio Amazon EMR Apache Spark Amazon S3

🔄

Transformation Pipeline (11 Steps)

The complete sequence of Data Wrangler transforms applied in Lab 1. Click any step to see details below.

💬Click any step in the pipeline above to see its details, or use auto-play to walk through all 11 steps.

🎭

Encoding Deep Dive: Ordinal vs One-Hot

ML algorithms do math — they multiply, add, and compare numbers. They cannot process strings like "Bachelors". Every categorical column must be converted to numbers. The question is how.

🔢 Ordinal Encoding

Output: Single column with integers (0, 1, 2, 3…)

Use when: Categories have a meaningful rank

Danger: Model thinks “3 is better than 1” — wrong if no real order exists

📊 One-Hot Encoding

Output: N new columns, each with 0 or 1

Use when: Categories are equal, no ranking

Danger: Explodes dimensionality (41 countries = 41 new columns)

Why These Columns Get Ordinal Encoding

Column	Values (sample)	Why Ordinal?	Encoded Result
`education`	Preschool, HS-grad, Bachelors, Masters, Doctorate	Clear hierarchy — Doctorate > Masters > Bachelors	0, 4, 7, 8, 9
`education_num`	1–16 (already numeric)	Resets to 0-based range for consistency	0, 1, 2, … 15
`occupation`	Exec-managerial, Prof-specialty, Handlers-cleaners	14 values — one-hot = 14 columns. XGBoost can split on ordinal thresholds even with imperfect order	0, 1, 2, … 13

Key insight for tree-based models: XGBoost uses threshold splits (e.g., “if occupation ≤ 5”). It can learn to group occupations by splitting at different thresholds, even if the ordering isn’t perfect. For linear models (Logistic Regression), this would be wrong.

Why These Columns Get One-Hot Encoding

These columns have no meaningful order. Assigning integers would create false relationships.

Column	Values	Why One-Hot (not Ordinal)	Columns Created
`marital_status`	Married, Divorced, Never-married, Separated, Widowed	Life states, not levels — “Divorced” isn’t “more” than “Married”	5 binary columns
`race`	White, Black, Asian-Pac-Islander, Amer-Indian, Other	No ranking — ordinal would imply one race is “higher”	5 binary columns
`relationship`	Husband, Wife, Own-child, Not-in-family, Unmarried	Family roles aren’t ordered	5 binary columns
`sex`	Male, Female	Binary but no rank — Male=0/Female=1 implies Female > Male	2 binary columns
`workclass`	Private, Self-emp, Federal-gov, Local-gov, State-gov	Employment types aren’t ranked	5 binary columns

Before & After: What the Data Looks Like

❌ Before (strings)

education	sex	workclass
Bachelors	Male	Private
HS-grad	Female	Federal-gov
Masters	Male	Self-emp

✅ After (numbers)

education	sex_M	sex_F	wc_Priv	wc_Fed
7	1	0	1	0
4	0	1	0	1
8	1	0	0	0

The decision rule: Ask “Is category A more than category B?”
✅ Yes (IC3 > IC1, Masters > Bachelors) → Ordinal
❌ No (Engineering ≠ “more than” HR, Hyderabad ≠ “higher than” Pune) → One-Hot

🔴

Target Encoding & Data Split

🎯 Why Encode the Target Column?

The income column contains strings: "<=50K" and ">50K". XGBoost (and most ML algorithms) require numeric targets for binary classification. Data Wrangler’s Search and Edit → Find and Replace converts them to 0/1.

Original Value	Replaced With	Meaning	Why This Assignment?
`<=50K`	1 (positive class)	Eligible for assistance	The business wants to find these people — making them the positive class means recall directly measures success
`>50K`	0 (negative class)	Not eligible	These are the “default” case the model doesn’t need to actively find

Why recall matters here: The advocacy group has a limited budget. Missing an eligible person (false negative) means they don’t get help. Incorrectly targeting a high-earner (false positive) just wastes one mailer. So recall (catching all eligible people) matters more than precision.

Step 9: Move Target to Column 0

⚠ XGBoost’s Silent Requirement

SageMaker’s built-in XGBoost container reads CSV without headers. It assumes column 0 = target label. If you forget to move income to the first position, XGBoost silently trains on the wrong column (e.g., age) and produces a useless model with no error message.

In Data Wrangler: Manage Columns → Move column → Move to start → income

Step 10: Train/Validation/Test Split (70/20/10)

📘 Training (70%)

Model learns patterns from this data. Sees it multiple times during training iterations.

📙 Validation (10%)

Tune hyperparameters and early stopping. Model never trains on this but uses it to check progress.

📗 Test (20%)

Final unbiased evaluation. Never seen during training or tuning. Simulates real-world performance.

Why randomized split? The adult dataset is sorted by certain features. A sequential split (first 70% for train) would give the model biased data. Random split ensures each subset has similar distributions of income, education, age, etc.

Step 11: Export to S3

Export	S3 Path	Format	Used In
Training set	`s3://bucket/scripts/data/train/`	CSV, no header, 1 partition	Lab 3 (XGBoost training)
Test set	`s3://bucket/scripts/data/test/`	CSV, no header, 1 partition	Lab 3 (model evaluation)
Validation set	`s3://bucket/scripts/data/validation/`	CSV, no header, 1 partition	Lab 4 (hyperparameter tuning)

🏢

HCM Mapping: AnyCompany Attrition Dataset

Every Lab 1 technique maps directly to AnyCompany’s anycompany_attrition_data.csv — 2,000 synthetic employee records predicting who will leave the company.

📊 Lab 1 Dataset (Census)

Target: income (≤50K or >50K)
Rows: 32,561
Features: 14 (age, education, occupation, etc.)
Goal: Predict income eligibility

🏢 AnyCompany Dataset (HCM)

Target: left_company (1=left, 0=stayed)
Rows: 2,000
Features: 19 (department, level, engagement, etc.)
Goal: Predict employee attrition

Column-by-Column Encoding Decisions

AnyCompany Column	Values	Encoding	Lab 1 Equivalent	Rationale
`department`	Engineering, Finance, HR, Legal, Marketing, Operations, Product, Sales (8)	One-Hot	`workclass`	No ranking — Engineering isn’t “more” than HR. Creates 8 binary columns.
`location`	Hyderabad, Pune, Chennai, Bengaluru, London, Roseland NJ, Singapore (7)	One-Hot	`race`	No ranking — Hyderabad isn’t “higher” than Pune. Creates 7 binary columns.
`work_mode`	Office, Hybrid, Remote (3)	One-Hot	`sex`	No ranking — Remote isn’t “more” than Office. Creates 3 binary columns.
`level`	IC1, IC2, IC3, IC4, IC5, M1, M2, M3 (8)	Ordinal	`education`	Clear hierarchy: IC1 < IC2 < … < IC5 < M1 < M2 < M3. Encodes to 0–7.
`education`	High School, Bachelors, Masters, PhD (4)	Ordinal	`education`	Same as Lab 1! HS < Bachelors < Masters < PhD. Encodes to 0–3.
`badge_color`	Blue, Green, Red, White, Yellow (5)	DROP	`fnlwgt`	Completely random (ROC ~0.5). No predictive value — just noise.
`exit_interview_scheduled`	No, Pending, Yes (3)	DROP	—	🚨 TARGET LEAKAGE! Only “Yes/Pending” for leavers. ROC ~1.0.
`resignation_notice_days`	0–30 (numeric)	DROP	—	🚨 TARGET LEAKAGE! Only non-zero for leavers. ROC ~1.0.
`employee_id`	EMP-00001 to EMP-02000	DROP	—	Identifier, not a feature. One-hot = 2000 columns (overfitting).

Numeric Columns (No Encoding Needed)

AnyCompany Column	Type	Range	Lab 1 Equivalent	Notes
`age`	Integer	22–60	`age`	Already numeric — use as-is
`tenure_months`	Integer	1–240	`hours_per_week`	Already numeric — use as-is
`salary_percentile`	Integer	10–95	`capital_gain`	Already numeric. Check for outliers.
`performance_score`	Float	1.0–5.0	—	Already numeric — use as-is
`engagement_score`	Float	1.0–10.0	—	Already numeric. Strong predictor (ROC ~0.7).
`months_since_promotion`	Integer	0–60	—	Already numeric. Strong predictor (ROC ~0.7).
`hours_per_week`	Integer	35–70	`hours_per_week`	Already numeric. Check for outliers > 60.
`commute_km`	Integer	1–80	—	Already numeric — use as-is
`training_hours_last_yr`	Integer	0–100	—	Already numeric — use as-is
`direct_reports`	Integer	0–15	—	Already numeric — use as-is
`manager_changes_2yr`	Integer	0–4	—	Already numeric. Moderate predictor.

Before & After: AnyCompany Data

❌ Raw CSV (strings + leakage)

dept	level	loc	work_mode	exit_int	left
HR	IC2	Hyderabad	Office	Yes	1
Eng	IC4	London	Remote	No	0
Sales	M1	Pune	Hybrid	No	0

✅ After Encoding (model-ready)

left	dept_HR	dept_Eng	dept_Sales	level	loc_Hyd	wm_Off
1	1	0	0	1	1	1
0	0	1	0	3	0	0
0	0	0	1	5	0	0

Column explosion: The AnyCompany dataset goes from 21 raw columns to ~30+ after one-hot encoding. department alone creates 8 new columns, location adds 7 more. This is why we drop useless columns first (badge_color, employee_id, exit_interview_scheduled, resignation_notice_days) — to avoid encoding columns that add no value.

Full Transformation Pipeline for AnyCompany

1
Drop Columns
Remove employee_id, badge_color, exit_interview_scheduled, resignation_notice_days (identifiers + leakage + noise)
2
Handle Missing
Check for nulls in engagement_score, performance_score. Impute with median or drop rows.
3
Strip Whitespace
Format String → Strip left/right on department, location, work_mode, education, level
4
Handle Outliers
Remove hours_per_week > 65 (likely data entry errors), commute_km > 70
5
Ordinal Encode
level → IC1=0, IC2=1, IC3=2, IC4=3, IC5=4, M1=5, M2=6, M3=7
education → High School=0, Bachelors=1, Masters=2, PhD=3
6
One-Hot Encode
department (8 cols), location (7 cols), work_mode (3 cols) — total 18 new binary columns
7
Move Target to Column 0
Move left_company to first position (XGBoost requirement)
8
Split & Export
70/20/10 randomized split → 3 CSVs to S3 (train, test, validation)

Instructor demo tip: Upload anycompany_attrition_data.csv to Data Wrangler during the lab walkthrough. Run the Target Leakage report — participants will see exit_interview_scheduled at ROC ~1.0 (leakage!) and badge_color at ROC ~0.5 (useless). This makes the concept visceral.

⚡

Spark on EMR: Processing at Scale

Tasks 4–6 connect SageMaker Studio to an Amazon EMR cluster for large-scale data processing using Apache Spark. This is the “big data” alternative to Data Wrangler’s visual interface.

🖥 Data Wrangler (Tasks 1–3)

Best for: Visual exploration, quick transforms, small-medium datasets (<10GB)

Interface: GUI — drag, click, configure

Scale: Single processing job, limited parallelism

⚡ Spark on EMR (Tasks 4–6)

Best for: Programmatic transforms, large datasets (10GB–PB), complex joins

Interface: Code — PySpark in Jupyter notebooks

Scale: Distributed across cluster nodes, massive parallelism

Connection Flow

1
Launch SageMaker Studio
Open JupyterLab workspace, clone the lab repository
2
Select SparkMagic PySpark Kernel
Switch from Python 3 to SparkMagic — this kernel routes code to the EMR cluster instead of running locally
3
Discover & Connect to EMR Cluster
Use the “Connect to cluster” dialog in Studio. Select the EMR cluster and authenticate (No-Auth for lab simplicity)
4
Run PySpark EDA
Execute Spark SQL queries, compute statistics, join datasets — all distributed across the cluster

HCM Use Case: When You Need Spark

Scenario	Data Size	Why Spark?
Monthly payroll anomaly detection	~50M transactions/month across 42M workers	Data Wrangler can’t handle 50M rows. Spark distributes across 20+ nodes.
Cross-country compliance joins	Tax rules × employees × jurisdictions = billions of combinations	Complex multi-table joins need distributed SQL engine.
Historical attrition feature engineering	10 years of HR data, 500M records	Window functions (rolling averages, lag features) over massive time series.
Real-time fraud feature pipeline	Streaming payroll events at 100K/sec	Spark Structured Streaming for real-time feature computation.

AnyCompany context: For the 2,000-row attrition dataset, Data Wrangler is perfect. But when AnyCompany scales to their full workforce (millions of employees across 140+ countries), the same transforms need Spark on EMR or SageMaker Processing with Spark containers (Lab 2).

Key Spark Concepts for ML Engineers

📦 DataFrame

Spark’s equivalent of a pandas DataFrame, but distributed across cluster nodes. Same column/row mental model, but operations run in parallel.

🔄 Lazy Evaluation

Spark doesn’t execute transforms immediately. It builds a plan (DAG) and only runs when you call an action like .show() or .count(). This allows optimization.

💾 Partitions

Data is split into chunks (partitions) distributed across nodes. More partitions = more parallelism, but also more overhead. Default: 200 partitions.

🔌 SparkMagic

A Jupyter kernel that sends code to a remote Spark cluster via Livy REST API. You write PySpark in a notebook, but it executes on EMR — not your local machine.

🧪 Lab 1: Data Preparation & Wrangling