Lab 3 — Interactive Explainer

Training a Model with SageMaker

Configure an XGBoost estimator, set hyperparameters, launch a managed training job, and evaluate model performance using the Debugger XGBoost report.

🎯 XGBoost ⚙️ Hyperparameters 📊 Model Evaluation 🏢 HCM Context 🧪 Lab 3

📋 Lab 3 Overview

In this lab you take the processed dataset from Labs 1 & 2 and train a binary classification model using Amazon SageMaker's built-in XGBoost algorithm. You configure an Estimator, set hyperparameters, launch a managed training job on dedicated ML instances, and evaluate the results through the auto-generated Debugger XGBoost report.

Duration: ~25 minutes • Phase: Model Training • Prerequisite: Labs 1–2 (processed CSV in S3)

What You Build

Input

📁 S3 Data

Train CSV (70%)
Validation CSV (10%)
Test CSV (20%) — held for Lab 5

Process

⚙️ Training Job

XGBoost 1.5-1
ml.m5.xlarge • 1000 rounds

Output

📦 Model Artifact

model.tar.gz in S3
+ XGBoost Debugger Report

Why 3 splits? Train (70%) = model learns patterns. Validation (10%) = monitors overfitting during training (logloss printed each round). Test (20%) = final evaluation after deployment (Lab 5) on data the model has never seen.

How Validation Works During Training

The validation set is not used to teach the model — it's a "practice exam" the model takes after each boosting round to check if it's actually learning generalizable patterns or just memorizing the training data.

StepWhat HappensAnalogy
1. TrainXGBoost builds a new tree using the train data (70%). The model's internal weights update.Student studies a chapter
2. ValidateXGBoost predicts on validation data (10%) — without learning from it. Computes logloss.Student takes a practice quiz (no peeking at answers)
3. LogBoth metrics printed: [50] train-logloss:0.309 val-logloss:0.317Teacher records both scores
4. RepeatSteps 1–3 repeat for all 1000 rounds (num_round)1000 study sessions with practice quizzes

✅ Learning Well

Train ↓ & Val ↓
Both improving together

⚠️ Overfitting

Train ↓ but Val → or ↑
Memorizing, not learning

❌ Underfitting

Both high & flat
Model too simple

In Lab 3: We saw overfitting start at ~round 250 (val-logloss plateaued at 0.300 while train kept dropping to 0.271). Lab 4 fixes this with early stopping — automatically halting training when validation stops improving.

The test set (20%) is the "final exam" — used only once after training is completely done (Lab 5) to get an honest, unbiased score. If you used it during training, you'd be "teaching to the test" and your real-world performance estimate would be too optimistic.

Key Concepts Covered

📦

SageMaker Estimator

High-level Python SDK interface that wraps container image, instance type, IAM role, and output path into a single object you call .fit() on.

🌳

XGBoost Algorithm

Gradient boosted trees — an ensemble of weak decision trees trained sequentially, each correcting the errors of the previous one. Ideal for tabular classification.

⚙️

Hyperparameters

Knobs that control training behavior (tree depth, learning rate, regularization). Set before training — not learned from data. Tuned in Lab 4.

📊

Debugger Report

Auto-generated notebook with loss curves, confusion matrix, feature importance, and classification metrics — no extra code needed.

💡
Why this matters at AnyCompany: The same pattern — Estimator + XGBoost + Debugger — is how you'd train an attrition prediction model on millions of employee records. The only differences: your features come from HRIS data (tenure, engagement scores, promotion history) and your target is left_company instead of income.

🔄 SageMaker Training Architecture

A SageMaker training job orchestrates multiple AWS services behind the scenes. Your notebook code simply calls .fit() — SageMaker handles provisioning compute, pulling the container image, downloading data from S3, running the algorithm, and uploading the model artifact back to S3.

Click any component to explore

📦 S3 Training Data: The processed CSV files from Labs 1–2 (train 70%, validation 10%) stored in your lab S3 bucket. XGBoost reads CSV with the target column first (column 0 = income label).
INPUT CONFIGURATION COMPUTE OUTPUT 📁 S3 Training Data train.csv + val.csv 📦 ECR Container XGBoost 1.5-1 ⚙️ Hyperparameters 7 config values 🖥️ ML Instance ml.m5.xlarge 🎯 Model Artifact model.tar.gz
Training Job Configuration
Instance Typeml.m5.xlarge (4 vCPU, 16 GB)
Instance Count1
AlgorithmXGBoost 1.5-1
Input Formattext/csv
Outputs3://bucket/scripts/data/output/
Debugger RuleCreateXgboostReport (every 5 steps)
🔍
How Debugger works: The CreateXgboostReport rule launches a separate processing job (on ml.t3.medium) that reads tensor snapshots saved every 5 rounds. It uses the container sagemaker-debugger-rules:latest from ECR to generate the report notebook. The training job itself always succeeds independently — the Debugger is an optional add-on for visualization.

🌳 XGBoost Deep Dive

XGBoost (eXtreme Gradient Boosting) is a supervised learning algorithm that builds an ensemble of decision trees sequentially. Each new tree focuses on correcting the mistakes of the previous ensemble, gradually reducing prediction error through gradient descent in function space.

How Gradient Boosting Works

🌳 Step 1 — Initial Prediction: Start with a simple baseline (e.g., average probability). For binary classification, this is the log-odds of the positive class in the training data.
🏁 Baseline Initial guess 🌳 Tree 1 Fix biggest errors 🌳 Tree 2 Fix residuals 🌳 Tree ... 1000 num_round iterations 🎯 Ensemble Sum of all trees

Why XGBoost for This Problem?

StrengthHow It HelpsHCM Relevance
Handles mixed featuresWorks with both ordinal-encoded and one-hot-encoded columns without normalizationEmployee data mixes numeric (tenure, salary) with categorical (department, level)
Built-in regularizationgamma, min_child_weight, and subsample prevent overfitting on noisy dataHR data is noisy — engagement surveys have measurement error
Feature importanceAutomatically ranks which features drive predictionsTells you whether tenure or manager changes matter more for attrition
Fast on CPUTrains in ~2 minutes on ml.m5.xlarge for 22K rows × 34 featuresMost HCM datasets are tabular and fit comfortably on CPU instances
Binary classificationbinary:logistic objective outputs probability [0,1]Attrition is binary: left_company = 1 or 0
📝
Note: XGBoost is the go-to algorithm for structured/tabular data competitions and enterprise ML. For AnyCompany's attrition model, it would likely outperform logistic regression while remaining interpretable through feature importance and SHAP values. Deep learning (neural nets) is overkill for tabular HCM data with <100 features.

⚙️ Hyperparameters Explained

Hyperparameters are configuration values set before training begins — they control how the algorithm learns, not what it learns. The lab uses 7 key hyperparameters for XGBoost. Getting these right is the difference between a model that generalizes well and one that memorizes training noise.

Lab 3 Hyperparameter Configuration

ParameterValueWhat It ControlsAnyCompany Analogy
max_depth5Maximum depth of each decision tree. Deeper = more complex patterns but higher overfitting risk.Like limiting how many "if-then" rules you chain: "IF tenure < 2 AND engagement < 3 AND no promotion AND remote AND ..."
eta0.1Learning rate — how much each new tree contributes. Lower = slower but more precise convergence.Taking small careful steps when adjusting attrition risk scores rather than large jumps
gamma4Minimum loss reduction to make a split. Higher = more conservative tree growth (regularization).Only split a node if it meaningfully separates leavers from stayers — ignore trivial differences
min_child_weight6Minimum sum of instance weights in a leaf. Prevents trees from creating rules based on too few examples.Don't create a rule from just 3 employees — need at least 6 data points to justify a pattern
subsample0.7Fraction of training data used per tree. Adds randomness to prevent overfitting (like bagging).Each tree sees 70% of employees — different random subsets — so no single outlier dominates
objectivebinary:logisticLoss function for binary classification. Outputs probability between 0 and 1.Predicting probability of attrition: 0.82 means 82% likely to leave within 12 months
num_round1000Number of boosting iterations (trees to build). More rounds with low eta = better generalization.Building 1000 small correction trees — each one slightly improves the attrition prediction

Hyperparameter Interaction Map

🛡️

Regularization Trio

gamma + min_child_weight + max_depth work together to prevent overfitting. High gamma prunes weak splits, min_child_weight requires sufficient evidence, and max_depth caps complexity.

🏎️

Speed vs Precision

eta × num_round = total learning capacity. Low eta (0.1) with high rounds (1000) gives fine-grained convergence. High eta (0.3) with fewer rounds trains faster but may overshoot.

🎲

Randomization

subsample = 0.7 means each tree only sees 70% of rows. This stochastic element reduces variance and makes the ensemble more robust to noisy HR survey data.

⚠️
Lab 4 Preview: These hyperparameters were chosen manually in Lab 3. In Lab 4, you'll use SageMaker Automatic Model Tuning (Bayesian optimization) to search for optimal values automatically — testing ranges like max_depth: [3,10] and eta: [0.01, 0.3] across multiple training jobs in parallel.

📊 Model Evaluation

After training completes, SageMaker Debugger automatically generates an XGBoost report containing loss curves, a confusion matrix, classification metrics, and feature importance rankings. This section breaks down how to interpret each component.

⚠️
Known Issue — Debugger Report May Fail: The CreateXgboostReport rule runs as a separate processing job that can fail with "AlgorithmError: Algorithm container exited with error." This is a platform issue (the debugger rules container image is incompatible with newer environments) — not something you did wrong. If this happens, skip the waiter cell and proceed to Task 2.6. The training itself completed successfully — you can read the metrics directly from the training logs printed above (train-logloss and validation-logloss at each round).

Reading the Training Logs Directly

Even without the Debugger report, the training output shows logloss at every round. Here's what a successful run looks like:

Actual Training Metrics (from our test run)
Round 0 (start)train: 0.548 | val: 0.550
Round 50train: 0.309 | val: 0.317
Round 100train: 0.294 | val: 0.304
Round 250train: 0.284 | val: 0.300
Round 500train: 0.278 | val: 0.301
Round 999 (final)train: 0.271 | val: 0.302
Training time130 seconds (billable)
Overfitting signalVal loss plateaus at ~round 250, then drifts up

How to read this: Lower logloss = better predictions. The gap between train and validation widens after round 250 — the model starts memorizing training noise rather than learning generalizable patterns. Lab 4's hyperparameter tuning (with early stopping) would catch this and stop training earlier.

Confusion Matrix — The Foundation

For binary classification (income ≤$50K = 1, >$50K = 0), the confusion matrix shows four outcomes:

True Positive (TP)

Predicted ≤$50K
Actually ≤$50K
Correct identification

False Positive (FP)

Predicted ≤$50K
Actually >$50K
Wasted outreach

⚠️

False Negative (FN)

Predicted >$50K
Actually ≤$50K
Missed person in need

True Negative (TN)

Predicted >$50K
Actually >$50K
Correct exclusion

Classification Metrics

MetricFormulaWhat It MeasuresHCM Interpretation
PrecisionTP / (TP + FP)Of all predicted positives, how many are actually positive?"Of employees we flagged as flight risks, how many actually left?" High precision = fewer false alarms.
RecallTP / (TP + FN)Of all actual positives, how many did we catch?"Of all employees who left, how many did we identify beforehand?" High recall = fewer missed departures.
F1-Score2 × (P × R) / (P + R)Harmonic mean of precision and recall — balances both.The target metric for Lab 3. Balances "don't cry wolf" (precision) with "don't miss anyone" (recall).
Accuracy(TP + TN) / TotalOverall correct predictions.Can be misleading with imbalanced classes — if 75% stay, predicting "stay" always gives 75% accuracy.

Feature Importance — What Drives Predictions?

The XGBoost report ranks features by how often they're used in tree splits. In the lab dataset (34 features after encoding), the top predictors typically include:

💰

capital_gain (f4)

Strong income signal — investment income correlates heavily with earning >$50K

👨‍👩‍👧

marital_status (f7)

Married-civ-spouse is a top predictor — household economics affect income bracket

🎓

education_num (f2)

Years of education directly correlates with earning potential

hours_per_week (f6)

Working hours signal full-time vs part-time employment patterns

💡
AnyCompany parallel: For an attrition model, you'd expect top features to be months_since_promotion, engagement_score, manager_changes_2yr, and salary_percentile. If exit_interview_scheduled ranks #1, that's target leakage — the feature only exists because the person already decided to leave (covered in Lab 1).

🏢 HCM Mapping — AnyCompany Attrition Model

Every step in Lab 3 maps directly to training an employee attrition prediction model at AnyCompany. Here's how you'd apply the same pattern to predict which employees are likely to leave within the next 12 months.

Lab 3 → AnyCompany Translation

Lab 3 (Census Income)AnyCompany (Attrition)Key Difference
Target: income (≤$50K = 1)Target: left_company (left = 1)Same binary classification pattern
34 features (age, education, occupation...)19 features (tenure, engagement, salary...)AnyCompany has fewer but more domain-specific features
~22K training rows~1.4K training rows (from 2000 synthetic)Real AnyCompany would have 100K+ from HRIS
70/10/20 split (train/val/test)Same split ratiosTime-based split preferred for attrition (train on past, test on recent)
XGBoost binary:logisticXGBoost binary:logisticSame algorithm — XGBoost excels on tabular HR data
F1-Score as target metricRecall prioritized (don't miss leavers)Missing a flight risk is costlier than a false alarm at AnyCompany
ml.m5.xlarge (single instance)ml.m5.xlarge sufficient for <1M rowsScale to ml.m5.4xlarge for full AnyCompany dataset

AnyCompany Training Configuration

Estimator Config — Attrition Model
AlgorithmXGBoost 1.5-1
Objectivebinary:logistic
Target Columnleft_company (column 0)
Key Featuresengagement_score, months_since_promotion, salary_percentile
Dropped (Leakage)exit_interview_scheduled, resignation_notice_days
Dropped (Useless)badge_color, employee_id
Instanceml.m5.xlarge (4 vCPU, 16 GB)
HyperparametersSame as Lab 3 — tuned in Lab 4

Expected Feature Importance (AnyCompany)

📈

engagement_score

Strongest predictor. Low engagement (1–3) correlates with 4× higher attrition. Equivalent to capital_gain in the lab.

months_since_promotion

Employees stuck >24 months without promotion show elevated flight risk. Time-based feature with clear business logic.

💵

salary_percentile

Below-market compensation (percentile <30) drives departures. Especially impactful for IC3–IC5 levels.

👥

manager_changes_2yr

Frequent manager changes (≥3 in 2 years) signal organizational instability. Moderate predictor with clear intervention path.

🚀
Production consideration: At AnyCompany scale (millions of employees across clients), you'd add a time-based train/test split — train on data from months 1–18, validate on months 19–21, test on months 22–24. This prevents future information from leaking into training and gives a realistic estimate of how the model performs on truly unseen future data.