← Back to Lab Guide Hub

🧪 Lab 1: Data Preparation & Wrangling

Analyze, clean, transform, and export demographic data using SageMaker Data Wrangler and Apache Spark on Amazon EMR. Learn the full transformation pipeline from raw CSV to model-ready features.

⏱ 1h 15m 📊 Data Wrangler ⚡ Apache Spark 🎯 Binary Classification 🏢 HCM Mapping
📋
Lab Overview & Scenario

🎯 The Business Problem

A citizen advocacy group needs to predict whether individuals earn ≤$50K/year to target government assistance promotions. You have demographic data (education, employment, marital status) but only partial income labels. Build a binary classification model to identify eligible citizens.

INPUT

📂 Raw CSV

adult_data.csv

32,561 rows 14 features 1 target
Strings, missing values, outliers, whitespace issues, useless columns
11 transforms
PROCESS

🔄 Data Wrangler

+ Spark on EMR

DropCleanEncodeSplit
export to S3
OUTPUT

🏁 3 Clean CSVs

70%Train
20%Test
10%Val
All numeric, encoded, no outliers, target at col 0. Ready for XGBoost (Lab 3).
🧠
What You Learn

🔍 EDA

Data Quality Reports, histograms, target leakage detection via ROC analysis

🧹 Cleaning

Drop columns, handle missing values, remove duplicates, strip whitespace

🔄 Encoding

Ordinal encoding (ranked categories) vs One-Hot encoding (unranked categories)

✂ Outliers

Min-Max threshold removal for extreme values that skew training

✂ Split & Export

70/20/10 randomized split, export to S3 as separate CSVs

⚡ Spark on EMR

Connect SageMaker Studio to EMR cluster, run PySpark EDA at scale

Key Insight: This lab covers the most time-consuming phase of any ML project — data preparation typically consumes 60–80% of total project time. Getting this right determines whether your model succeeds or fails in production.

🛠
AWS Services Used
SageMaker Data Wrangler SageMaker Canvas SageMaker Studio Amazon EMR Apache Spark Amazon S3
🔄
Transformation Pipeline (11 Steps)

The complete sequence of Data Wrangler transforms applied in Lab 1. Click any step to see details below.

STEP 1Drop Columnsfnlwgt, native_country STEP 2Drop Missingoccupation, workclass = ? STEP 3Drop DuplicatesRemove identical rows STEP 4Format StringStrip left & right spaces STEP 5Encode Target≤50K→1, >50K→0 STEP 6Handle Outlierscapital_gain > 80K STEP 7Ordinal Encodeeducation, occupation STEP 8One-Hot Encodesex, race, workclass... STEP 9Move Target Col 0XGBoost requirement STEP 10Split 70/20/10Train / Test / Validation STEP 11Export to S33 CSVs → S3 bucket Clean Format/Export Target Outliers Encode
💬Click any step in the pipeline above to see its details, or use auto-play to walk through all 11 steps.
🎭
Encoding Deep Dive: Ordinal vs One-Hot

ML algorithms do math — they multiply, add, and compare numbers. They cannot process strings like "Bachelors". Every categorical column must be converted to numbers. The question is how.

🔢 Ordinal Encoding

Output: Single column with integers (0, 1, 2, 3…)

Use when: Categories have a meaningful rank

Danger: Model thinks “3 is better than 1” — wrong if no real order exists

📊 One-Hot Encoding

Output: N new columns, each with 0 or 1

Use when: Categories are equal, no ranking

Danger: Explodes dimensionality (41 countries = 41 new columns)

Why These Columns Get Ordinal Encoding
ColumnValues (sample)Why Ordinal?Encoded Result
educationPreschool, HS-grad, Bachelors, Masters, DoctorateClear hierarchy — Doctorate > Masters > Bachelors0, 4, 7, 8, 9
education_num1–16 (already numeric)Resets to 0-based range for consistency0, 1, 2, … 15
occupationExec-managerial, Prof-specialty, Handlers-cleaners14 values — one-hot = 14 columns. XGBoost can split on ordinal thresholds even with imperfect order0, 1, 2, … 13

Key insight for tree-based models: XGBoost uses threshold splits (e.g., “if occupation ≤ 5”). It can learn to group occupations by splitting at different thresholds, even if the ordering isn’t perfect. For linear models (Logistic Regression), this would be wrong.

Why These Columns Get One-Hot Encoding

These columns have no meaningful order. Assigning integers would create false relationships.

ColumnValuesWhy One-Hot (not Ordinal)Columns Created
marital_statusMarried, Divorced, Never-married, Separated, WidowedLife states, not levels — “Divorced” isn’t “more” than “Married”5 binary columns
raceWhite, Black, Asian-Pac-Islander, Amer-Indian, OtherNo ranking — ordinal would imply one race is “higher”5 binary columns
relationshipHusband, Wife, Own-child, Not-in-family, UnmarriedFamily roles aren’t ordered5 binary columns
sexMale, FemaleBinary but no rank — Male=0/Female=1 implies Female > Male2 binary columns
workclassPrivate, Self-emp, Federal-gov, Local-gov, State-govEmployment types aren’t ranked5 binary columns
Before & After: What the Data Looks Like

❌ Before (strings)

educationsexworkclass
BachelorsMalePrivate
HS-gradFemaleFederal-gov
MastersMaleSelf-emp

✅ After (numbers)

educationsex_Msex_Fwc_Privwc_Fed
71010
40101
81000

The decision rule: Ask “Is category A more than category B?”
Yes (IC3 > IC1, Masters > Bachelors) → Ordinal
No (Engineering ≠ “more than” HR, Hyderabad ≠ “higher than” Pune) → One-Hot

🔴
Target Encoding & Data Split

🎯 Why Encode the Target Column?

The income column contains strings: "<=50K" and ">50K". XGBoost (and most ML algorithms) require numeric targets for binary classification. Data Wrangler’s Search and Edit → Find and Replace converts them to 0/1.

Original ValueReplaced WithMeaningWhy This Assignment?
<=50K1 (positive class)Eligible for assistanceThe business wants to find these people — making them the positive class means recall directly measures success
>50K0 (negative class)Not eligibleThese are the “default” case the model doesn’t need to actively find

Why recall matters here: The advocacy group has a limited budget. Missing an eligible person (false negative) means they don’t get help. Incorrectly targeting a high-earner (false positive) just wastes one mailer. So recall (catching all eligible people) matters more than precision.

Step 9: Move Target to Column 0

⚠ XGBoost’s Silent Requirement

SageMaker’s built-in XGBoost container reads CSV without headers. It assumes column 0 = target label. If you forget to move income to the first position, XGBoost silently trains on the wrong column (e.g., age) and produces a useless model with no error message.

In Data Wrangler: Manage Columns → Move column → Move to start → income

Step 10: Train/Validation/Test Split (70/20/10)

📘 Training (70%)

Model learns patterns from this data. Sees it multiple times during training iterations.

📙 Validation (10%)

Tune hyperparameters and early stopping. Model never trains on this but uses it to check progress.

📗 Test (20%)

Final unbiased evaluation. Never seen during training or tuning. Simulates real-world performance.

Why randomized split? The adult dataset is sorted by certain features. A sequential split (first 70% for train) would give the model biased data. Random split ensures each subset has similar distributions of income, education, age, etc.

Step 11: Export to S3
ExportS3 PathFormatUsed In
Training sets3://bucket/scripts/data/train/CSV, no header, 1 partitionLab 3 (XGBoost training)
Test sets3://bucket/scripts/data/test/CSV, no header, 1 partitionLab 3 (model evaluation)
Validation sets3://bucket/scripts/data/validation/CSV, no header, 1 partitionLab 4 (hyperparameter tuning)
🏢
HCM Mapping: AnyCompany Attrition Dataset

Every Lab 1 technique maps directly to AnyCompany’s anycompany_attrition_data.csv — 2,000 synthetic employee records predicting who will leave the company.

📊 Lab 1 Dataset (Census)

Target: income (≤50K or >50K)
Rows: 32,561
Features: 14 (age, education, occupation, etc.)
Goal: Predict income eligibility

🏢 AnyCompany Dataset (HCM)

Target: left_company (1=left, 0=stayed)
Rows: 2,000
Features: 19 (department, level, engagement, etc.)
Goal: Predict employee attrition

Column-by-Column Encoding Decisions
AnyCompany ColumnValuesEncodingLab 1 EquivalentRationale
departmentEngineering, Finance, HR, Legal, Marketing, Operations, Product, Sales (8)One-HotworkclassNo ranking — Engineering isn’t “more” than HR. Creates 8 binary columns.
locationHyderabad, Pune, Chennai, Bengaluru, London, Roseland NJ, Singapore (7)One-HotraceNo ranking — Hyderabad isn’t “higher” than Pune. Creates 7 binary columns.
work_modeOffice, Hybrid, Remote (3)One-HotsexNo ranking — Remote isn’t “more” than Office. Creates 3 binary columns.
levelIC1, IC2, IC3, IC4, IC5, M1, M2, M3 (8)OrdinaleducationClear hierarchy: IC1 < IC2 < … < IC5 < M1 < M2 < M3. Encodes to 0–7.
educationHigh School, Bachelors, Masters, PhD (4)OrdinaleducationSame as Lab 1! HS < Bachelors < Masters < PhD. Encodes to 0–3.
badge_colorBlue, Green, Red, White, Yellow (5)DROPfnlwgtCompletely random (ROC ~0.5). No predictive value — just noise.
exit_interview_scheduledNo, Pending, Yes (3)DROP🚨 TARGET LEAKAGE! Only “Yes/Pending” for leavers. ROC ~1.0.
resignation_notice_days0–30 (numeric)DROP🚨 TARGET LEAKAGE! Only non-zero for leavers. ROC ~1.0.
employee_idEMP-00001 to EMP-02000DROPIdentifier, not a feature. One-hot = 2000 columns (overfitting).
Numeric Columns (No Encoding Needed)
AnyCompany ColumnTypeRangeLab 1 EquivalentNotes
ageInteger22–60ageAlready numeric — use as-is
tenure_monthsInteger1–240hours_per_weekAlready numeric — use as-is
salary_percentileInteger10–95capital_gainAlready numeric. Check for outliers.
performance_scoreFloat1.0–5.0Already numeric — use as-is
engagement_scoreFloat1.0–10.0Already numeric. Strong predictor (ROC ~0.7).
months_since_promotionInteger0–60Already numeric. Strong predictor (ROC ~0.7).
hours_per_weekInteger35–70hours_per_weekAlready numeric. Check for outliers > 60.
commute_kmInteger1–80Already numeric — use as-is
training_hours_last_yrInteger0–100Already numeric — use as-is
direct_reportsInteger0–15Already numeric — use as-is
manager_changes_2yrInteger0–4Already numeric. Moderate predictor.
Before & After: AnyCompany Data

❌ Raw CSV (strings + leakage)

deptlevellocwork_modeexit_intleft
HRIC2HyderabadOfficeYes1
EngIC4LondonRemoteNo0
SalesM1PuneHybridNo0

✅ After Encoding (model-ready)

leftdept_HRdept_Engdept_Saleslevelloc_Hydwm_Off
1100111
0010300
0001500

Column explosion: The AnyCompany dataset goes from 21 raw columns to ~30+ after one-hot encoding. department alone creates 8 new columns, location adds 7 more. This is why we drop useless columns first (badge_color, employee_id, exit_interview_scheduled, resignation_notice_days) — to avoid encoding columns that add no value.

Full Transformation Pipeline for AnyCompany
  1. 1
    Drop Columns

    Remove employee_id, badge_color, exit_interview_scheduled, resignation_notice_days (identifiers + leakage + noise)

  2. 2
    Handle Missing

    Check for nulls in engagement_score, performance_score. Impute with median or drop rows.

  3. 3
    Strip Whitespace

    Format String → Strip left/right on department, location, work_mode, education, level

  4. 4
    Handle Outliers

    Remove hours_per_week > 65 (likely data entry errors), commute_km > 70

  5. 5
    Ordinal Encode

    level → IC1=0, IC2=1, IC3=2, IC4=3, IC5=4, M1=5, M2=6, M3=7
    education → High School=0, Bachelors=1, Masters=2, PhD=3

  6. 6
    One-Hot Encode

    department (8 cols), location (7 cols), work_mode (3 cols) — total 18 new binary columns

  7. 7
    Move Target to Column 0

    Move left_company to first position (XGBoost requirement)

  8. 8
    Split & Export

    70/20/10 randomized split → 3 CSVs to S3 (train, test, validation)

Instructor demo tip: Upload anycompany_attrition_data.csv to Data Wrangler during the lab walkthrough. Run the Target Leakage report — participants will see exit_interview_scheduled at ROC ~1.0 (leakage!) and badge_color at ROC ~0.5 (useless). This makes the concept visceral.

Spark on EMR: Processing at Scale

Tasks 4–6 connect SageMaker Studio to an Amazon EMR cluster for large-scale data processing using Apache Spark. This is the “big data” alternative to Data Wrangler’s visual interface.

🖥 Data Wrangler (Tasks 1–3)

Best for: Visual exploration, quick transforms, small-medium datasets (<10GB)

Interface: GUI — drag, click, configure

Scale: Single processing job, limited parallelism

⚡ Spark on EMR (Tasks 4–6)

Best for: Programmatic transforms, large datasets (10GB–PB), complex joins

Interface: Code — PySpark in Jupyter notebooks

Scale: Distributed across cluster nodes, massive parallelism

Connection Flow
  1. 1
    Launch SageMaker Studio

    Open JupyterLab workspace, clone the lab repository

  2. 2
    Select SparkMagic PySpark Kernel

    Switch from Python 3 to SparkMagic — this kernel routes code to the EMR cluster instead of running locally

  3. 3
    Discover & Connect to EMR Cluster

    Use the “Connect to cluster” dialog in Studio. Select the EMR cluster and authenticate (No-Auth for lab simplicity)

  4. 4
    Run PySpark EDA

    Execute Spark SQL queries, compute statistics, join datasets — all distributed across the cluster

HCM Use Case: When You Need Spark
ScenarioData SizeWhy Spark?
Monthly payroll anomaly detection~50M transactions/month across 42M workersData Wrangler can’t handle 50M rows. Spark distributes across 20+ nodes.
Cross-country compliance joinsTax rules × employees × jurisdictions = billions of combinationsComplex multi-table joins need distributed SQL engine.
Historical attrition feature engineering10 years of HR data, 500M recordsWindow functions (rolling averages, lag features) over massive time series.
Real-time fraud feature pipelineStreaming payroll events at 100K/secSpark Structured Streaming for real-time feature computation.

AnyCompany context: For the 2,000-row attrition dataset, Data Wrangler is perfect. But when AnyCompany scales to their full workforce (millions of employees across 140+ countries), the same transforms need Spark on EMR or SageMaker Processing with Spark containers (Lab 2).

Key Spark Concepts for ML Engineers

📦 DataFrame

Spark’s equivalent of a pandas DataFrame, but distributed across cluster nodes. Same column/row mental model, but operations run in parallel.

🔄 Lazy Evaluation

Spark doesn’t execute transforms immediately. It builds a plan (DAG) and only runs when you call an action like .show() or .count(). This allows optimization.

💾 Partitions

Data is split into chunks (partitions) distributed across nodes. More partitions = more parallelism, but also more overhead. Default: 200 partitions.

🔌 SparkMagic

A Jupyter kernel that sends code to a remote Spark cluster via Livy REST API. You write PySpark in a notebook, but it executes on EMR — not your local machine.