Your roadmap through 7 labs covering the full ML lifecycle on AWS — from raw data to production monitoring
Click any lab node to see its details. Labs follow the ML lifecycle from data preparation through production monitoring.
Target leakage occurs when a feature in your training data is strongly correlated with the target label but would not be available at prediction time in the real world. It is essentially “cheating” — the model gets information that leaks the answer, scores perfectly in training, but fails in production.
SageMaker Data Wrangler computes the ROC score for each feature column individually via cross-validation. The ROC score tells you how well that single feature alone can predict the target:
| ROC Score | Meaning | Action |
|---|---|---|
| 0.5 | Zero predictive power (coin flip) | Drop — adds noise without helping |
| 0.6 – 0.8 | Normal predictive power | Keep — good features |
| 0.9 – 1.0 | Suspiciously high | Investigate — possible leakage! |
The adult_data.csv has 14 features predicting income (≤$50K or >$50K). The target leakage report reveals:
| Feature | ROC | Verdict |
|---|---|---|
| education_num | ~0.7 | ✅ Good predictor — keep |
| marital_status | ~0.7 | ✅ Good predictor — keep |
| hours_per_week | ~0.65 | ✅ Moderate — keep |
| fnlwgt | ~0.5 | ❌ Useless — drop |
| native_country | ~0.5 | ❌ Useless — drop |
Result: No target leakage detected (no feature near 1.0). But fnlwgt (census sampling weight) and native_country are dropped because ROC = 0.5 means they contribute nothing — just noise that slows training.
Prevents false confidence: A leaked feature gives 99% accuracy in training but 50% in production. Target leakage analysis catches this before you waste weeks on a model that will fail when deployed.
Removes noise: Features with ROC = 0.5 add computational cost without improving predictions. Dropping them makes training faster and can improve accuracy.
Imagine building an attrition prediction model (will this employee leave?):
| Feature | ROC | Problem |
|---|---|---|
| exit_interview_date | ~1.0 | 🚨 Leakage! If populated, they already left |
| badge_color | ~0.5 | ❌ Useless — no correlation |
| months_since_promotion | ~0.7 | ✅ Genuinely predictive — keep |
ML algorithms expect clean, numeric, consistently formatted data. Raw datasets from production systems contain messy strings, missing values, outliers, and mixed formats. Task 3 applies a transformation pipeline that converts raw census data into model-ready features.
The complete sequence of transforms applied in Data Wrangler, in order:
The income column contains string labels: "<=50K" and ">50K". XGBoost (and most ML algorithms) require numeric targets for binary classification. Data Wrangler’s Search and Edit → Find and Replace converts them:
| Original Value | Replaced With | Meaning |
|---|---|---|
<=50K | 1 | Positive class (eligible for assistance) |
>50K | 0 | Negative class (not eligible) |
Why this encoding? The advocacy group wants to find people who earn ≤$50K. Making that the positive class (1) means the model’s recall metric directly measures “what % of eligible people did we correctly identify?” — the metric the business cares about most.
XGBoost requirement: The target column must also be the first column in the CSV (Step 9). XGBoost’s built-in SageMaker container reads column 0 as the label by default — no header row, no column name mapping. If you forget this step, the model trains on the wrong column silently.
The adult dataset has categorical columns with inconsistent whitespace — values like " Private", "Private ", and " Private " would be treated as three different categories without cleaning.
| Operation | What It Does | Example |
|---|---|---|
| Strip left/right | Remove leading & trailing whitespace | " Private " → "Private" |
| Applied to | education, income, marital_status, occupation, race, relationship, sex, workclass | |
Why it matters: Without stripping, one-hot encoding creates duplicate columns for " Private" vs "Private" — inflating dimensionality and confusing the model. Also, the target encoding (Step 5) would fail: " <=50K" wouldn’t match the pattern "<=50K", leaving string values in the target column and crashing training.
| Step | Action | Rationale |
|---|---|---|
| 1. Drop columns | Remove fnlwgt, native_country | ROC = 0.5 (zero predictive power, just noise) |
| 2. Drop missing | Remove rows where occupation or workclass = "?" | ~6% of rows; random pattern, safe to drop |
| 3. Drop duplicates | Remove identical rows | Prevents model from memorizing repeated examples |
The capital_gain column has extreme values (e.g., 99,999) that represent rare stock sales. These outliers skew gradient calculations during training. Data Wrangler removes values > 80,000 or < 0 using Min-Max numeric outliers with the Remove fix method.
| Encoding | When to Use | Lab 1 Columns |
|---|---|---|
| Ordinal | Natural order exists | education (HS < Bachelors < Masters < PhD), education_num, occupation |
| One-Hot | No inherent order | marital_status, race, relationship, sex, workclass |
Trap: One-hot encoding native_country (41 unique values) would create 41 sparse columns. Since ROC = 0.5 (useless), we dropped it in Step 1 before encoding — saving 41 dimensions and speeding up training.
| Step | Action | Why |
|---|---|---|
| 9. Move target | Move income to column 0 | XGBoost reads first column as label (no header) |
| 10. Split | 70% train / 20% test / 10% validation | Separate data for learning, tuning, and final eval |
| 11. Export | 3 CSVs to S3 (train/, test/, validation/) | Ready for SageMaker Training Job in Lab 3 |
Imagine preparing an AnyCompany attrition prediction dataset:
| Lab 1 Step | HCM Equivalent |
|---|---|
| Strip whitespace | " Hyderabad " → "Hyderabad" (location from HRIS export) |
| Encode target | "Left" → 1, "Stayed" → 0 (binary attrition label) |
| Handle missing | New hires with blank manager_id → impute with “Unassigned” |
| Remove outliers | One-time $500K bonus flagged as outlier in monthly salary column |
| Ordinal encode | job_level: IC1 < IC2 < IC3 < M1 < M2 < VP |
| One-hot encode | department: Engineering, Payroll, HR, Sales (no order) |
| Move target to col 0 | XGBoost expects left_company as first column |
Real-world impact: AnyCompany processes payroll for millions of workers across multiple countries. A single inconsistent string format (e.g., "India" vs " India" vs "IN") can split one country into three categories, causing the model to underperform on that segment. The strip transform prevents this silently.
ML algorithms do math — they multiply, add, and compare numbers. They cannot process strings like "Bachelors" or "Private". Every categorical column must be converted to numbers before training. The question is how you convert them, because the wrong method introduces false relationships.
| Method | Output | When to Use | Danger If Misused |
|---|---|---|---|
| Ordinal | Single column: 0, 1, 2, 3… | Categories have a meaningful rank | Model thinks “3 is better than 1” — wrong if no real order exists |
| One-Hot | N columns, each 0 or 1 | Categories are equal, no ranking | Explodes dimensionality (41 countries = 41 columns) |
education Gets Ordinal EncodingEducation levels have a natural hierarchy — a Doctorate represents more education than a high school diploma. The model should know that Doctorate > Masters > Bachelors > HS-grad.
| Original String | Ordinal Value | Interpretation |
|---|---|---|
Preschool | 0 | Lowest education |
HS-grad | 4 | High school |
Some-college | 5 | Partial college |
Bachelors | 7 | Undergraduate degree |
Masters | 8 | Graduate degree |
Doctorate | 9 | Highest education |
Result: The model can now learn “higher education number → higher income probability” as a smooth relationship, which is exactly how income works in reality.
education_num Gets Ordinal EncodingThis column is already numeric (values 1–16), but Data Wrangler’s ordinal encoder resets it to start at 0. This is a normalization step — some algorithms perform better when features start at 0 rather than an arbitrary number. It also ensures the encoding is consistent with the education column.
occupation Gets Ordinal EncodingThis is the debatable choice in the lab. Occupations like "Exec-managerial", "Prof-specialty", "Handlers-cleaners" don’t have an obvious rank. However:
| Reason | Explanation |
|---|---|
| Dimensionality | 14 unique occupations → one-hot creates 14 columns. Ordinal keeps it as 1 column. |
| Income correlation | Occupations do have an implicit income hierarchy (Exec > Sales > Handlers). The ordinal encoder assigns values that XGBoost can split on. |
| Tree-based models | XGBoost uses threshold splits (e.g., “if occupation ≤ 5”). It can learn to group occupations by splitting at different thresholds, even if the ordering isn’t perfect. |
Key insight: For tree-based models (XGBoost, Random Forest), ordinal encoding works even when the order isn’t perfect — the tree just learns multiple splits. For linear models (Logistic Regression), this would be wrong because it assumes a linear relationship between the integer and the target.
sex, race, workclass, marital_status, relationship Get One-Hot EncodingThese columns have no meaningful order. If you assigned integers (Male=0, Female=1), the model would interpret Female as “greater than” Male — which is meaningless. One-hot encoding creates separate binary columns so the model treats each category independently:
| Lab 1 Column | Values | Why One-Hot (not Ordinal) |
|---|---|---|
marital_status | Married, Divorced, Never-married, Separated, Widowed | Life states, not levels — “Divorced” isn’t “more” than “Married” |
race | White, Black, Asian-Pac-Islander, Amer-Indian, Other | No ranking exists — ordinal would imply one race is “higher” |
relationship | Husband, Wife, Own-child, Not-in-family, Unmarried | Family roles aren’t ordered — “Husband” isn’t “more” than “Own-child” |
sex | Male, Female | Binary but no rank — Male=0/Female=1 implies Female > Male |
workclass | Private, Self-emp, Federal-gov, Local-gov, State-gov | Employment types aren’t ranked — government isn’t “more” than private |
What one-hot actually does: Each unique value becomes its own column with 0 or 1. If workclass has 5 values, it becomes 5 new columns: workclass_Private, workclass_Self-emp, workclass_Federal-gov, etc. Only one column is “1” per row — the rest are “0”.
| ❌ Before Encoding (strings — model can’t use) | ||
|---|---|---|
| education | sex | workclass |
| Bachelors | Male | Private |
| HS-grad | Female | Federal-gov |
| Masters | Male | Self-emp |
| ✅ After Encoding (numbers — model can process) | ||||
|---|---|---|---|---|
| education (ordinal) | sex_Male (one-hot) | sex_Female (one-hot) | workclass_Private (one-hot) | workclass_Federal-gov (one-hot) |
| 7 | 1 | 0 | 1 | 0 |
| 4 | 0 | 1 | 0 | 1 |
| 8 | 1 | 0 | 0 | 0 |
anycompany_attrition_data.csv)Our synthetic HCM dataset has 21 columns for 2,000 employees. Here’s how each categorical column maps to the same encoding decisions from Lab 1:
| AnyCompany Column | Unique Values | Encoding | Lab 1 Equivalent | Rationale |
|---|---|---|---|---|
department | Engineering, Finance, HR, Legal, Marketing, Operations, Product, Sales (8) | One-Hot | workclass | No ranking — Engineering isn’t “more” than HR. Creates 8 binary columns. |
location | Hyderabad, Pune, Chennai, Bengaluru, London, Roseland NJ, Singapore (7) | One-Hot | race / relationship | No ranking — Hyderabad isn’t “higher” than Pune. Creates 7 binary columns. |
work_mode | Office, Hybrid, Remote (3) | One-Hot | sex | No ranking — Remote isn’t “more” than Office. Creates 3 binary columns. |
badge_color | Blue, Green, Red, White, Yellow (5) | DROP | fnlwgt | Completely random (ROC ~0.5) — no predictive value. Don’t encode, just drop. |
level | IC1, IC2, IC3, IC4, IC5, M1, M2, M3 (8) | Ordinal | education | Clear hierarchy: IC1 < IC2 < IC3 < IC4 < IC5 < M1 < M2 < M3 |
education | High School, Bachelors, Masters, PhD (4) | Ordinal | education | Same as Lab 1! HS < Bachelors < Masters < PhD |
exit_interview_scheduled | No, Pending, Yes (3) | DROP | — | 🚨 TARGET LEAKAGE! Only “Yes/Pending” for leavers. Would give ROC ~1.0. |
employee_id | EMP-00001 to EMP-02000 (2000) | DROP | — | Identifier, not a feature. One-hot = 2000 columns (overfitting disaster). |
| ❌ Before (raw CSV — 3 sample rows) | ||||
|---|---|---|---|---|
| department | level | location | work_mode | education |
| HR | IC2 | Hyderabad | Office | Bachelors |
| Engineering | IC4 | London | Remote | Masters |
| Sales | M1 | Pune | Hybrid | PhD |
| ✅ After Encoding (model-ready) | ||||||||
|---|---|---|---|---|---|---|---|---|
| dept_HR | dept_Eng | dept_Sales | level (ordinal) | loc_Hyd | loc_Lon | loc_Pune | wm_Office | education (ordinal) |
| 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
| 0 | 1 | 0 | 3 | 0 | 1 | 0 | 0 | 2 |
| 0 | 0 | 1 | 5 | 0 | 0 | 1 | 0 | 3 |
Column explosion: The AnyCompany dataset goes from 21 columns to ~30+ after one-hot encoding. department alone creates 8 new columns. This is why we drop useless columns first (badge_color, employee_id, exit_interview_scheduled) — to avoid encoding columns that add no value.
XGBoost builds decision trees by asking yes/no questions about features:
| Encoding | How XGBoost Uses It | Example Split |
|---|---|---|
Ordinal (level) | Threshold split: “Is level ≤ 4?” (separates ICs from Managers) | IC1-IC5 go left, M1-M3 go right → managers leave less |
One-Hot (dept_Engineering) | Binary split: “Is dept_Engineering = 1?” | Engineers go left, everyone else goes right → engineers leave more |
The decision rule: Ask yourself “Is category A more than category B?”
✅ Yes (IC3 > IC1, Masters > Bachelors, M2 > M1) → Ordinal
❌ No (Engineering ≠ “more than” HR, Hyderabad ≠ “higher than” Pune) → One-Hot
| Mistake | What Happens | Fix |
|---|---|---|
Ordinal encode department | Model thinks Engineering(0) < HR(1) < Sales(2) — learns false “Sales is highest” | Use one-hot instead |
One-hot encode job_level | Model loses the hierarchy — treats VP and IC1 as equally different from IC2 | Use ordinal instead |
One-hot encode high-cardinality column (e.g., employee_id) | Creates 10,000+ sparse columns, model overfits to individual employees | Drop the column entirely |
In Tasks 1–3, you used SageMaker Data Wrangler — a visual, low-code tool for exploring and transforming data. It works great for prototyping on small datasets. But in production, you need to process millions of records programmatically, with version-controlled code, on a scalable cluster. That’s where Amazon EMR comes in.
The lab teaches both approaches: Data Wrangler for quick visual exploration (Tasks 1–3) and Spark on EMR for scalable, code-first processing (Tasks 5–6). In practice, you prototype in Data Wrangler, then implement the production pipeline in Spark.
| Layer | Technology | Role | Analogy |
|---|---|---|---|
| Top | Apache Spark (PySpark) | Processing engine — distributes work across machines, keeps data in memory | The workers who do the actual computation |
| Middle | Apache Hive | SQL catalog — knows where tables live and their schema | The filing system that says “adult_data is in drawer 3” |
| Bottom | Hadoop (YARN + HDFS/S3) | Resource manager + storage — decides which machine runs what | The office building and its room assignments |
Amazon EMR is the AWS managed service that runs this entire stack for you — no server setup, no cluster configuration, auto-scaling built in.
Spark’s key innovation: it keeps data in memory between processing steps (100x faster than Hadoop MapReduce, which writes to disk after every step). You write Python (PySpark), Spark distributes the work across the cluster automatically.
| Concept | What It Means | Lab 1 Example |
|---|---|---|
| DataFrame | Like a Pandas DataFrame, but split across multiple machines | adult_df = sqlContext.sql("select * from adult_data") |
| Transformation | A lazy operation (builds a plan, doesn’t execute yet) | StringIndexer, OneHotEncoder, VectorAssembler |
| Action | Triggers actual computation across the cluster | adult_df.count(), .show(), .toPandas() |
| Pipeline | Chain of transformers applied in sequence (reproducible) | Pipeline(stages=indexers + [encoder, assembler]) |
Hive provides a metadata layer so you can query files using SQL. Instead of writing code to parse CSV files from S3, you register a table once and query it with familiar SQL:
| Without Hive | With Hive |
|---|---|
spark.read.csv("s3://bucket/path/adult.csv", header=True, schema=...) | sqlContext.sql("select * from adult_data") |
In Lab 1, the EMR cluster has the adult dataset pre-registered as a Hive table. That’s why show tables returns adult_data.
Task 6 performs the same feature engineering you did visually in Data Wrangler (Task 3), but now as reproducible code:
| Step | Spark Code | Data Wrangler Equivalent |
|---|---|---|
| Load data | sqlContext.sql("select * from adult_data") | Import CSV into Data Wrangler |
| Explore shape | adult_df.count() → 1000 rows | Data Quality & Insights Report |
| Encode categoricals | StringIndexer + OneHotEncoder | Ordinal / One-Hot encode transforms |
| Assemble features | VectorAssembler → single feature vector | Export step (auto-combines columns) |
| Encode target | StringIndexer(inputCol='income') | Search & Replace (≤50K→1, >50K→0) |
| Approach | Best For | Limitation |
|---|---|---|
| Data Wrangler (visual) | Quick EDA, prototyping transforms, target leakage checks | Doesn’t scale to millions of rows; not version-controlled |
| PySpark on EMR (code) | Production pipelines, large datasets, CI/CD integration | Requires coding; slower to iterate during exploration |
Why this matters at scale: Imagine processing payroll records for millions of workers across multiple countries. Data Wrangler can handle a 10,000-row sample for exploration. But the production feature engineering pipeline — joining employee records with tax tables, encoding job levels, computing tenure features — needs Spark on EMR (or AWS Glue, which is serverless Spark) to process the full dataset in minutes rather than hours.
You don’t have to manually rewrite your Data Wrangler transforms as Spark code. Data Wrangler has built-in export options that generate code from your visual flow:
| Export Target | What You Get | When to Use |
|---|---|---|
| SageMaker Processing Job | Python/PySpark script that runs your exact transforms on managed infrastructure | Scheduled batch processing (e.g., nightly payroll feature refresh) |
| SageMaker Pipeline | Your flow as a step in an MLOps pipeline | End-to-end automation (data prep → train → deploy) |
| Python Notebook | Editable .ipynb with all transforms as code | Customize logic, add error handling, integrate with EMR/Glue |
| Feature Store | Pushes transformed features directly to SageMaker Feature Store | Reusable features shared across multiple models |
Real workflow: Prototype transforms visually in Data Wrangler → Export as Python/PySpark code → Customize (add error handling, parameterize S3 paths) → Deploy as a scheduled Spark job on EMR or Glue. You never rewrite from scratch.
Note: Lab 1 doesn’t demonstrate this export path — it has you do both manually to teach the underlying concepts. In a real project, you’d use Data Wrangler’s export as your starting point.
You don’t need to become a Spark expert for this course. The important concept is: Data Wrangler = prototype, Spark = production. The lab shows you both so you understand the full workflow from exploration to scalable implementation. And in practice, Data Wrangler bridges the gap by exporting your visual transforms as production-ready code.
Visual data wrangling, Spark processing at scale, feature engineering with encoding techniques
XGBoost training jobs, algorithm configuration, model artifact management
Automatic hyperparameter optimization, parallel training runs, model selection
Blue/Green deployment, linear traffic shifting, CloudWatch alarm guardrails
SageMaker Pipelines workflow, Model Registry versioning, automated ML workflows
Data drift detection, quality baselines, automated retraining triggers via Step Functions