Lab 3: Training a Model with SageMaker | AnyCompany ML Engineering

📋 Lab 3 Overview

In this lab you take the processed dataset from Labs 1 & 2 and train a binary classification model using Amazon SageMaker's built-in XGBoost algorithm. You configure an Estimator, set hyperparameters, launch a managed training job on dedicated ML instances, and evaluate the results through the auto-generated Debugger XGBoost report.

Duration: ~25 minutes • Phase: Model Training • Prerequisite: Labs 1–2 (processed CSV in S3)

What You Build

Input

📁 S3 Data

Train CSV (70%)
Validation CSV (10%)
Test CSV (20%) — held for Lab 5

→

Process

⚙️ Training Job

XGBoost 1.5-1
ml.m5.xlarge • 1000 rounds

→

Output

📦 Model Artifact

model.tar.gz in S3
+ XGBoost Debugger Report

Why 3 splits? Train (70%) = model learns patterns. Validation (10%) = monitors overfitting during training (logloss printed each round). Test (20%) = final evaluation after deployment (Lab 5) on data the model has never seen.

How Validation Works During Training

The validation set is not used to teach the model — it's a "practice exam" the model takes after each boosting round to check if it's actually learning generalizable patterns or just memorizing the training data.

Step	What Happens	Analogy
1. Train	XGBoost builds a new tree using the train data (70%). The model's internal weights update.	Student studies a chapter
2. Validate	XGBoost predicts on validation data (10%) — without learning from it. Computes logloss.	Student takes a practice quiz (no peeking at answers)
3. Log	Both metrics printed: `[50] train-logloss:0.309 val-logloss:0.317`	Teacher records both scores
4. Repeat	Steps 1–3 repeat for all 1000 rounds (num_round)	1000 study sessions with practice quizzes

✅ Learning Well

Train ↓ & Val ↓
Both improving together

⚠️ Overfitting

Train ↓ but Val → or ↑
Memorizing, not learning

❌ Underfitting

Both high & flat
Model too simple

In Lab 3: We saw overfitting start at ~round 250 (val-logloss plateaued at 0.300 while train kept dropping to 0.271). Lab 4 fixes this with early stopping — automatically halting training when validation stops improving.

The test set (20%) is the "final exam" — used only once after training is completely done (Lab 5) to get an honest, unbiased score. If you used it during training, you'd be "teaching to the test" and your real-world performance estimate would be too optimistic.

Key Concepts Covered

📦

SageMaker Estimator

High-level Python SDK interface that wraps container image, instance type, IAM role, and output path into a single object you call .fit() on.

🌳

XGBoost Algorithm

Gradient boosted trees — an ensemble of weak decision trees trained sequentially, each correcting the errors of the previous one. Ideal for tabular classification.

⚙️

Hyperparameters

Knobs that control training behavior (tree depth, learning rate, regularization). Set before training — not learned from data. Tuned in Lab 4.

📊

Debugger Report

Auto-generated notebook with loss curves, confusion matrix, feature importance, and classification metrics — no extra code needed.

💡

Why this matters at AnyCompany: The same pattern — Estimator + XGBoost + Debugger — is how you'd train an attrition prediction model on millions of employee records. The only differences: your features come from HRIS data (tenure, engagement scores, promotion history) and your target is left_company instead of income.

🔄 SageMaker Training Architecture

A SageMaker training job orchestrates multiple AWS services behind the scenes. Your notebook code simply calls .fit() — SageMaker handles provisioning compute, pulling the container image, downloading data from S3, running the algorithm, and uploading the model artifact back to S3.

Click any component to explore

📦 S3 Training Data: The processed CSV files from Labs 1–2 (train 70%, validation 10%) stored in your lab S3 bucket. XGBoost reads CSV with the target column first (column 0 = income label).

Training Job Configuration

Instance Typeml.m5.xlarge (4 vCPU, 16 GB)

Instance Count1

AlgorithmXGBoost 1.5-1

Input Formattext/csv

Outputs3://bucket/scripts/data/output/

Debugger RuleCreateXgboostReport (every 5 steps)

🔍

How Debugger works: The CreateXgboostReport rule launches a separate processing job (on ml.t3.medium) that reads tensor snapshots saved every 5 rounds. It uses the container sagemaker-debugger-rules:latest from ECR to generate the report notebook. The training job itself always succeeds independently — the Debugger is an optional add-on for visualization.

🌳 XGBoost Deep Dive

XGBoost (eXtreme Gradient Boosting) is a supervised learning algorithm that builds an ensemble of decision trees sequentially. Each new tree focuses on correcting the mistakes of the previous ensemble, gradually reducing prediction error through gradient descent in function space.

How Gradient Boosting Works

🌳 Step 1 — Initial Prediction: Start with a simple baseline (e.g., average probability). For binary classification, this is the log-odds of the positive class in the training data.

Why XGBoost for This Problem?

Strength	How It Helps	HCM Relevance
Handles mixed features	Works with both ordinal-encoded and one-hot-encoded columns without normalization	Employee data mixes numeric (tenure, salary) with categorical (department, level)
Built-in regularization	gamma, min_child_weight, and subsample prevent overfitting on noisy data	HR data is noisy — engagement surveys have measurement error
Feature importance	Automatically ranks which features drive predictions	Tells you whether tenure or manager changes matter more for attrition
Fast on CPU	Trains in ~2 minutes on ml.m5.xlarge for 22K rows × 34 features	Most HCM datasets are tabular and fit comfortably on CPU instances
Binary classification	`binary:logistic` objective outputs probability [0,1]	Attrition is binary: left_company = 1 or 0

📝

Note: XGBoost is the go-to algorithm for structured/tabular data competitions and enterprise ML. For AnyCompany's attrition model, it would likely outperform logistic regression while remaining interpretable through feature importance and SHAP values. Deep learning (neural nets) is overkill for tabular HCM data with <100 features.

⚙️ Hyperparameters Explained

Hyperparameters are configuration values set before training begins — they control how the algorithm learns, not what it learns. The lab uses 7 key hyperparameters for XGBoost. Getting these right is the difference between a model that generalizes well and one that memorizes training noise.

Lab 3 Hyperparameter Configuration

Parameter	Value	What It Controls	AnyCompany Analogy
`max_depth`	5	Maximum depth of each decision tree. Deeper = more complex patterns but higher overfitting risk.	Like limiting how many "if-then" rules you chain: "IF tenure < 2 AND engagement < 3 AND no promotion AND remote AND ..."
`eta`	0.1	Learning rate — how much each new tree contributes. Lower = slower but more precise convergence.	Taking small careful steps when adjusting attrition risk scores rather than large jumps
`gamma`	4	Minimum loss reduction to make a split. Higher = more conservative tree growth (regularization).	Only split a node if it meaningfully separates leavers from stayers — ignore trivial differences
`min_child_weight`	6	Minimum sum of instance weights in a leaf. Prevents trees from creating rules based on too few examples.	Don't create a rule from just 3 employees — need at least 6 data points to justify a pattern
`subsample`	0.7	Fraction of training data used per tree. Adds randomness to prevent overfitting (like bagging).	Each tree sees 70% of employees — different random subsets — so no single outlier dominates
`objective`	binary:logistic	Loss function for binary classification. Outputs probability between 0 and 1.	Predicting probability of attrition: 0.82 means 82% likely to leave within 12 months
`num_round`	1000	Number of boosting iterations (trees to build). More rounds with low eta = better generalization.	Building 1000 small correction trees — each one slightly improves the attrition prediction

Hyperparameter Interaction Map

🛡️

Regularization Trio

gamma + min_child_weight + max_depth work together to prevent overfitting. High gamma prunes weak splits, min_child_weight requires sufficient evidence, and max_depth caps complexity.

🏎️

Speed vs Precision

eta × num_round = total learning capacity. Low eta (0.1) with high rounds (1000) gives fine-grained convergence. High eta (0.3) with fewer rounds trains faster but may overshoot.

🎲

Randomization

subsample = 0.7 means each tree only sees 70% of rows. This stochastic element reduces variance and makes the ensemble more robust to noisy HR survey data.

⚠️

Lab 4 Preview: These hyperparameters were chosen manually in Lab 3. In Lab 4, you'll use SageMaker Automatic Model Tuning (Bayesian optimization) to search for optimal values automatically — testing ranges like max_depth: [3,10] and eta: [0.01, 0.3] across multiple training jobs in parallel.

📊 Model Evaluation

After training completes, SageMaker Debugger automatically generates an XGBoost report containing loss curves, a confusion matrix, classification metrics, and feature importance rankings. This section breaks down how to interpret each component.

⚠️

Known Issue — Debugger Report May Fail: The CreateXgboostReport rule runs as a separate processing job that can fail with "AlgorithmError: Algorithm container exited with error." This is a platform issue (the debugger rules container image is incompatible with newer environments) — not something you did wrong. If this happens, skip the waiter cell and proceed to Task 2.6. The training itself completed successfully — you can read the metrics directly from the training logs printed above (train-logloss and validation-logloss at each round).

Reading the Training Logs Directly

Even without the Debugger report, the training output shows logloss at every round. Here's what a successful run looks like:

Actual Training Metrics (from our test run)

Round 0 (start)train: 0.548 | val: 0.550

Round 50train: 0.309 | val: 0.317

Round 100train: 0.294 | val: 0.304

Round 250train: 0.284 | val: 0.300

Round 500train: 0.278 | val: 0.301

Round 999 (final)train: 0.271 | val: 0.302

Training time130 seconds (billable)

Overfitting signalVal loss plateaus at ~round 250, then drifts up

How to read this: Lower logloss = better predictions. The gap between train and validation widens after round 250 — the model starts memorizing training noise rather than learning generalizable patterns. Lab 4's hyperparameter tuning (with early stopping) would catch this and stop training earlier.

Confusion Matrix — The Foundation

For binary classification (income ≤$50K = 1, >$50K = 0), the confusion matrix shows four outcomes:

✅

True Positive (TP)

Predicted ≤$50K
Actually ≤$50K
Correct identification

❌

False Positive (FP)

Predicted ≤$50K
Actually >$50K
Wasted outreach

⚠️

False Negative (FN)

Predicted >$50K
Actually ≤$50K
Missed person in need

✅

True Negative (TN)

Predicted >$50K
Actually >$50K
Correct exclusion

Classification Metrics

Metric	Formula	What It Measures	HCM Interpretation
Precision	TP / (TP + FP)	Of all predicted positives, how many are actually positive?	"Of employees we flagged as flight risks, how many actually left?" High precision = fewer false alarms.
Recall	TP / (TP + FN)	Of all actual positives, how many did we catch?	"Of all employees who left, how many did we identify beforehand?" High recall = fewer missed departures.
F1-Score	2 × (P × R) / (P + R)	Harmonic mean of precision and recall — balances both.	The target metric for Lab 3. Balances "don't cry wolf" (precision) with "don't miss anyone" (recall).
Accuracy	(TP + TN) / Total	Overall correct predictions.	Can be misleading with imbalanced classes — if 75% stay, predicting "stay" always gives 75% accuracy.

Feature Importance — What Drives Predictions?

The XGBoost report ranks features by how often they're used in tree splits. In the lab dataset (34 features after encoding), the top predictors typically include:

💰

capital_gain (f4)

Strong income signal — investment income correlates heavily with earning >$50K

👨‍👩‍👧

marital_status (f7)

Married-civ-spouse is a top predictor — household economics affect income bracket

🎓

education_num (f2)

Years of education directly correlates with earning potential

⏰

hours_per_week (f6)

Working hours signal full-time vs part-time employment patterns

💡

AnyCompany parallel: For an attrition model, you'd expect top features to be months_since_promotion, engagement_score, manager_changes_2yr, and salary_percentile. If exit_interview_scheduled ranks #1, that's target leakage — the feature only exists because the person already decided to leave (covered in Lab 1).

🏢 HCM Mapping — AnyCompany Attrition Model

Every step in Lab 3 maps directly to training an employee attrition prediction model at AnyCompany. Here's how you'd apply the same pattern to predict which employees are likely to leave within the next 12 months.

Lab 3 → AnyCompany Translation

Lab 3 (Census Income)	AnyCompany (Attrition)	Key Difference
Target: `income` (≤$50K = 1)	Target: `left_company` (left = 1)	Same binary classification pattern
34 features (age, education, occupation...)	19 features (tenure, engagement, salary...)	AnyCompany has fewer but more domain-specific features
~22K training rows	~1.4K training rows (from 2000 synthetic)	Real AnyCompany would have 100K+ from HRIS
70/10/20 split (train/val/test)	Same split ratios	Time-based split preferred for attrition (train on past, test on recent)
XGBoost `binary:logistic`	XGBoost `binary:logistic`	Same algorithm — XGBoost excels on tabular HR data
F1-Score as target metric	Recall prioritized (don't miss leavers)	Missing a flight risk is costlier than a false alarm at AnyCompany
ml.m5.xlarge (single instance)	ml.m5.xlarge sufficient for <1M rows	Scale to ml.m5.4xlarge for full AnyCompany dataset

AnyCompany Training Configuration

Estimator Config — Attrition Model

AlgorithmXGBoost 1.5-1

Objectivebinary:logistic

Target Columnleft_company (column 0)

Key Featuresengagement_score, months_since_promotion, salary_percentile

Dropped (Leakage)exit_interview_scheduled, resignation_notice_days

Dropped (Useless)badge_color, employee_id

Instanceml.m5.xlarge (4 vCPU, 16 GB)

HyperparametersSame as Lab 3 — tuned in Lab 4

Expected Feature Importance (AnyCompany)

📈

engagement_score

Strongest predictor. Low engagement (1–3) correlates with 4× higher attrition. Equivalent to capital_gain in the lab.

⏳

months_since_promotion

Employees stuck >24 months without promotion show elevated flight risk. Time-based feature with clear business logic.

💵

salary_percentile

Below-market compensation (percentile <30) drives departures. Especially impactful for IC3–IC5 levels.

👥

manager_changes_2yr

Frequent manager changes (≥3 in 2 years) signal organizational instability. Moderate predictor with clear intervention path.

🚀

Production consideration: At AnyCompany scale (millions of employees across clients), you'd add a time-based train/test split — train on data from months 1–18, validate on months 19–21, test on months 22–24. This prevents future information from leaking into training and gives a realistic estimate of how the model performs on truly unseen future data.