Analyze, clean, transform, and export demographic data using SageMaker Data Wrangler and Apache Spark on Amazon EMR. Learn the full transformation pipeline from raw CSV to model-ready features.
A citizen advocacy group needs to predict whether individuals earn ≤$50K/year to target government assistance promotions. You have demographic data (education, employment, marital status) but only partial income labels. Build a binary classification model to identify eligible citizens.
adult_data.csv
+ Spark on EMR
Data Quality Reports, histograms, target leakage detection via ROC analysis
Drop columns, handle missing values, remove duplicates, strip whitespace
Ordinal encoding (ranked categories) vs One-Hot encoding (unranked categories)
Min-Max threshold removal for extreme values that skew training
70/20/10 randomized split, export to S3 as separate CSVs
Connect SageMaker Studio to EMR cluster, run PySpark EDA at scale
Key Insight: This lab covers the most time-consuming phase of any ML project — data preparation typically consumes 60–80% of total project time. Getting this right determines whether your model succeeds or fails in production.
The complete sequence of Data Wrangler transforms applied in Lab 1. Click any step to see details below.
ML algorithms do math — they multiply, add, and compare numbers. They cannot process strings like "Bachelors". Every categorical column must be converted to numbers. The question is how.
Output: Single column with integers (0, 1, 2, 3…)
Use when: Categories have a meaningful rank
Danger: Model thinks “3 is better than 1” — wrong if no real order exists
Output: N new columns, each with 0 or 1
Use when: Categories are equal, no ranking
Danger: Explodes dimensionality (41 countries = 41 new columns)
| Column | Values (sample) | Why Ordinal? | Encoded Result |
|---|---|---|---|
education | Preschool, HS-grad, Bachelors, Masters, Doctorate | Clear hierarchy — Doctorate > Masters > Bachelors | 0, 4, 7, 8, 9 |
education_num | 1–16 (already numeric) | Resets to 0-based range for consistency | 0, 1, 2, … 15 |
occupation | Exec-managerial, Prof-specialty, Handlers-cleaners | 14 values — one-hot = 14 columns. XGBoost can split on ordinal thresholds even with imperfect order | 0, 1, 2, … 13 |
Key insight for tree-based models: XGBoost uses threshold splits (e.g., “if occupation ≤ 5”). It can learn to group occupations by splitting at different thresholds, even if the ordering isn’t perfect. For linear models (Logistic Regression), this would be wrong.
These columns have no meaningful order. Assigning integers would create false relationships.
| Column | Values | Why One-Hot (not Ordinal) | Columns Created |
|---|---|---|---|
marital_status | Married, Divorced, Never-married, Separated, Widowed | Life states, not levels — “Divorced” isn’t “more” than “Married” | 5 binary columns |
race | White, Black, Asian-Pac-Islander, Amer-Indian, Other | No ranking — ordinal would imply one race is “higher” | 5 binary columns |
relationship | Husband, Wife, Own-child, Not-in-family, Unmarried | Family roles aren’t ordered | 5 binary columns |
sex | Male, Female | Binary but no rank — Male=0/Female=1 implies Female > Male | 2 binary columns |
workclass | Private, Self-emp, Federal-gov, Local-gov, State-gov | Employment types aren’t ranked | 5 binary columns |
| education | sex | workclass |
|---|---|---|
| Bachelors | Male | Private |
| HS-grad | Female | Federal-gov |
| Masters | Male | Self-emp |
| education | sex_M | sex_F | wc_Priv | wc_Fed |
|---|---|---|---|---|
| 7 | 1 | 0 | 1 | 0 |
| 4 | 0 | 1 | 0 | 1 |
| 8 | 1 | 0 | 0 | 0 |
The decision rule: Ask “Is category A more than category B?”
✅ Yes (IC3 > IC1, Masters > Bachelors) → Ordinal
❌ No (Engineering ≠ “more than” HR, Hyderabad ≠ “higher than” Pune) → One-Hot
The income column contains strings: "<=50K" and ">50K". XGBoost (and most ML algorithms) require numeric targets for binary classification. Data Wrangler’s Search and Edit → Find and Replace converts them to 0/1.
| Original Value | Replaced With | Meaning | Why This Assignment? |
|---|---|---|---|
<=50K | 1 (positive class) | Eligible for assistance | The business wants to find these people — making them the positive class means recall directly measures success |
>50K | 0 (negative class) | Not eligible | These are the “default” case the model doesn’t need to actively find |
Why recall matters here: The advocacy group has a limited budget. Missing an eligible person (false negative) means they don’t get help. Incorrectly targeting a high-earner (false positive) just wastes one mailer. So recall (catching all eligible people) matters more than precision.
SageMaker’s built-in XGBoost container reads CSV without headers. It assumes column 0 = target label. If you forget to move income to the first position, XGBoost silently trains on the wrong column (e.g., age) and produces a useless model with no error message.
In Data Wrangler: Manage Columns → Move column → Move to start → income
Model learns patterns from this data. Sees it multiple times during training iterations.
Tune hyperparameters and early stopping. Model never trains on this but uses it to check progress.
Final unbiased evaluation. Never seen during training or tuning. Simulates real-world performance.
Why randomized split? The adult dataset is sorted by certain features. A sequential split (first 70% for train) would give the model biased data. Random split ensures each subset has similar distributions of income, education, age, etc.
| Export | S3 Path | Format | Used In |
|---|---|---|---|
| Training set | s3://bucket/scripts/data/train/ | CSV, no header, 1 partition | Lab 3 (XGBoost training) |
| Test set | s3://bucket/scripts/data/test/ | CSV, no header, 1 partition | Lab 3 (model evaluation) |
| Validation set | s3://bucket/scripts/data/validation/ | CSV, no header, 1 partition | Lab 4 (hyperparameter tuning) |
Every Lab 1 technique maps directly to AnyCompany’s anycompany_attrition_data.csv — 2,000 synthetic employee records predicting who will leave the company.
Target: income (≤50K or >50K)
Rows: 32,561
Features: 14 (age, education, occupation, etc.)
Goal: Predict income eligibility
Target: left_company (1=left, 0=stayed)
Rows: 2,000
Features: 19 (department, level, engagement, etc.)
Goal: Predict employee attrition
| AnyCompany Column | Values | Encoding | Lab 1 Equivalent | Rationale |
|---|---|---|---|---|
department | Engineering, Finance, HR, Legal, Marketing, Operations, Product, Sales (8) | One-Hot | workclass | No ranking — Engineering isn’t “more” than HR. Creates 8 binary columns. |
location | Hyderabad, Pune, Chennai, Bengaluru, London, Roseland NJ, Singapore (7) | One-Hot | race | No ranking — Hyderabad isn’t “higher” than Pune. Creates 7 binary columns. |
work_mode | Office, Hybrid, Remote (3) | One-Hot | sex | No ranking — Remote isn’t “more” than Office. Creates 3 binary columns. |
level | IC1, IC2, IC3, IC4, IC5, M1, M2, M3 (8) | Ordinal | education | Clear hierarchy: IC1 < IC2 < … < IC5 < M1 < M2 < M3. Encodes to 0–7. |
education | High School, Bachelors, Masters, PhD (4) | Ordinal | education | Same as Lab 1! HS < Bachelors < Masters < PhD. Encodes to 0–3. |
badge_color | Blue, Green, Red, White, Yellow (5) | DROP | fnlwgt | Completely random (ROC ~0.5). No predictive value — just noise. |
exit_interview_scheduled | No, Pending, Yes (3) | DROP | — | 🚨 TARGET LEAKAGE! Only “Yes/Pending” for leavers. ROC ~1.0. |
resignation_notice_days | 0–30 (numeric) | DROP | — | 🚨 TARGET LEAKAGE! Only non-zero for leavers. ROC ~1.0. |
employee_id | EMP-00001 to EMP-02000 | DROP | — | Identifier, not a feature. One-hot = 2000 columns (overfitting). |
| AnyCompany Column | Type | Range | Lab 1 Equivalent | Notes |
|---|---|---|---|---|
age | Integer | 22–60 | age | Already numeric — use as-is |
tenure_months | Integer | 1–240 | hours_per_week | Already numeric — use as-is |
salary_percentile | Integer | 10–95 | capital_gain | Already numeric. Check for outliers. |
performance_score | Float | 1.0–5.0 | — | Already numeric — use as-is |
engagement_score | Float | 1.0–10.0 | — | Already numeric. Strong predictor (ROC ~0.7). |
months_since_promotion | Integer | 0–60 | — | Already numeric. Strong predictor (ROC ~0.7). |
hours_per_week | Integer | 35–70 | hours_per_week | Already numeric. Check for outliers > 60. |
commute_km | Integer | 1–80 | — | Already numeric — use as-is |
training_hours_last_yr | Integer | 0–100 | — | Already numeric — use as-is |
direct_reports | Integer | 0–15 | — | Already numeric — use as-is |
manager_changes_2yr | Integer | 0–4 | — | Already numeric. Moderate predictor. |
| dept | level | loc | work_mode | exit_int | left |
|---|---|---|---|---|---|
| HR | IC2 | Hyderabad | Office | Yes | 1 |
| Eng | IC4 | London | Remote | No | 0 |
| Sales | M1 | Pune | Hybrid | No | 0 |
| left | dept_HR | dept_Eng | dept_Sales | level | loc_Hyd | wm_Off |
|---|---|---|---|---|---|---|
| 1 | 1 | 0 | 0 | 1 | 1 | 1 |
| 0 | 0 | 1 | 0 | 3 | 0 | 0 |
| 0 | 0 | 0 | 1 | 5 | 0 | 0 |
Column explosion: The AnyCompany dataset goes from 21 raw columns to ~30+ after one-hot encoding. department alone creates 8 new columns, location adds 7 more. This is why we drop useless columns first (badge_color, employee_id, exit_interview_scheduled, resignation_notice_days) — to avoid encoding columns that add no value.
Remove employee_id, badge_color, exit_interview_scheduled, resignation_notice_days (identifiers + leakage + noise)
Check for nulls in engagement_score, performance_score. Impute with median or drop rows.
Format String → Strip left/right on department, location, work_mode, education, level
Remove hours_per_week > 65 (likely data entry errors), commute_km > 70
level → IC1=0, IC2=1, IC3=2, IC4=3, IC5=4, M1=5, M2=6, M3=7education → High School=0, Bachelors=1, Masters=2, PhD=3
department (8 cols), location (7 cols), work_mode (3 cols) — total 18 new binary columns
Move left_company to first position (XGBoost requirement)
70/20/10 randomized split → 3 CSVs to S3 (train, test, validation)
Instructor demo tip: Upload anycompany_attrition_data.csv to Data Wrangler during the lab walkthrough. Run the Target Leakage report — participants will see exit_interview_scheduled at ROC ~1.0 (leakage!) and badge_color at ROC ~0.5 (useless). This makes the concept visceral.
Tasks 4–6 connect SageMaker Studio to an Amazon EMR cluster for large-scale data processing using Apache Spark. This is the “big data” alternative to Data Wrangler’s visual interface.
Best for: Visual exploration, quick transforms, small-medium datasets (<10GB)
Interface: GUI — drag, click, configure
Scale: Single processing job, limited parallelism
Best for: Programmatic transforms, large datasets (10GB–PB), complex joins
Interface: Code — PySpark in Jupyter notebooks
Scale: Distributed across cluster nodes, massive parallelism
Open JupyterLab workspace, clone the lab repository
Switch from Python 3 to SparkMagic — this kernel routes code to the EMR cluster instead of running locally
Use the “Connect to cluster” dialog in Studio. Select the EMR cluster and authenticate (No-Auth for lab simplicity)
Execute Spark SQL queries, compute statistics, join datasets — all distributed across the cluster
| Scenario | Data Size | Why Spark? |
|---|---|---|
| Monthly payroll anomaly detection | ~50M transactions/month across 42M workers | Data Wrangler can’t handle 50M rows. Spark distributes across 20+ nodes. |
| Cross-country compliance joins | Tax rules × employees × jurisdictions = billions of combinations | Complex multi-table joins need distributed SQL engine. |
| Historical attrition feature engineering | 10 years of HR data, 500M records | Window functions (rolling averages, lag features) over massive time series. |
| Real-time fraud feature pipeline | Streaming payroll events at 100K/sec | Spark Structured Streaming for real-time feature computation. |
AnyCompany context: For the 2,000-row attrition dataset, Data Wrangler is perfect. But when AnyCompany scales to their full workforce (millions of employees across 140+ countries), the same transforms need Spark on EMR or SageMaker Processing with Spark containers (Lab 2).
Spark’s equivalent of a pandas DataFrame, but distributed across cluster nodes. Same column/row mental model, but operations run in parallel.
Spark doesn’t execute transforms immediately. It builds a plan (DAG) and only runs when you call an action like .show() or .count(). This allows optimization.
Data is split into chunks (partitions) distributed across nodes. More partitions = more parallelism, but also more overhead. Default: 200 partitions.
A Jupyter kernel that sends code to a remote Spark cluster via Livy REST API. You write PySpark in a notebook, but it executes on EMR — not your local machine.