Lab 3 — Interactive Explainer
Configure an XGBoost estimator, set hyperparameters, launch a managed training job, and evaluate model performance using the Debugger XGBoost report.
In this lab you take the processed dataset from Labs 1 & 2 and train a binary classification model using Amazon SageMaker's built-in XGBoost algorithm. You configure an Estimator, set hyperparameters, launch a managed training job on dedicated ML instances, and evaluate the results through the auto-generated Debugger XGBoost report.
Duration: ~25 minutes • Phase: Model Training • Prerequisite: Labs 1–2 (processed CSV in S3)
Train CSV (70%)
Validation CSV (10%)
Test CSV (20%) — held for Lab 5
XGBoost 1.5-1
ml.m5.xlarge • 1000 rounds
model.tar.gz in S3
+ XGBoost Debugger Report
Why 3 splits? Train (70%) = model learns patterns. Validation (10%) = monitors overfitting during training (logloss printed each round). Test (20%) = final evaluation after deployment (Lab 5) on data the model has never seen.
The validation set is not used to teach the model — it's a "practice exam" the model takes after each boosting round to check if it's actually learning generalizable patterns or just memorizing the training data.
| Step | What Happens | Analogy |
|---|---|---|
| 1. Train | XGBoost builds a new tree using the train data (70%). The model's internal weights update. | Student studies a chapter |
| 2. Validate | XGBoost predicts on validation data (10%) — without learning from it. Computes logloss. | Student takes a practice quiz (no peeking at answers) |
| 3. Log | Both metrics printed: [50] train-logloss:0.309 val-logloss:0.317 | Teacher records both scores |
| 4. Repeat | Steps 1–3 repeat for all 1000 rounds (num_round) | 1000 study sessions with practice quizzes |
Train ↓ & Val ↓
Both improving together
Train ↓ but Val → or ↑
Memorizing, not learning
Both high & flat
Model too simple
In Lab 3: We saw overfitting start at ~round 250 (val-logloss plateaued at 0.300 while train kept dropping to 0.271). Lab 4 fixes this with early stopping — automatically halting training when validation stops improving.
The test set (20%) is the "final exam" — used only once after training is completely done (Lab 5) to get an honest, unbiased score. If you used it during training, you'd be "teaching to the test" and your real-world performance estimate would be too optimistic.
High-level Python SDK interface that wraps container image, instance type, IAM role, and output path into a single object you call .fit() on.
Gradient boosted trees — an ensemble of weak decision trees trained sequentially, each correcting the errors of the previous one. Ideal for tabular classification.
Knobs that control training behavior (tree depth, learning rate, regularization). Set before training — not learned from data. Tuned in Lab 4.
Auto-generated notebook with loss curves, confusion matrix, feature importance, and classification metrics — no extra code needed.
left_company instead of income.A SageMaker training job orchestrates multiple AWS services behind the scenes. Your notebook code simply calls .fit() — SageMaker handles provisioning compute, pulling the container image, downloading data from S3, running the algorithm, and uploading the model artifact back to S3.
CreateXgboostReport rule launches a separate processing job (on ml.t3.medium) that reads tensor snapshots saved every 5 rounds. It uses the container sagemaker-debugger-rules:latest from ECR to generate the report notebook. The training job itself always succeeds independently — the Debugger is an optional add-on for visualization.XGBoost (eXtreme Gradient Boosting) is a supervised learning algorithm that builds an ensemble of decision trees sequentially. Each new tree focuses on correcting the mistakes of the previous ensemble, gradually reducing prediction error through gradient descent in function space.
| Strength | How It Helps | HCM Relevance |
|---|---|---|
| Handles mixed features | Works with both ordinal-encoded and one-hot-encoded columns without normalization | Employee data mixes numeric (tenure, salary) with categorical (department, level) |
| Built-in regularization | gamma, min_child_weight, and subsample prevent overfitting on noisy data | HR data is noisy — engagement surveys have measurement error |
| Feature importance | Automatically ranks which features drive predictions | Tells you whether tenure or manager changes matter more for attrition |
| Fast on CPU | Trains in ~2 minutes on ml.m5.xlarge for 22K rows × 34 features | Most HCM datasets are tabular and fit comfortably on CPU instances |
| Binary classification | binary:logistic objective outputs probability [0,1] | Attrition is binary: left_company = 1 or 0 |
Hyperparameters are configuration values set before training begins — they control how the algorithm learns, not what it learns. The lab uses 7 key hyperparameters for XGBoost. Getting these right is the difference between a model that generalizes well and one that memorizes training noise.
| Parameter | Value | What It Controls | AnyCompany Analogy |
|---|---|---|---|
max_depth | 5 | Maximum depth of each decision tree. Deeper = more complex patterns but higher overfitting risk. | Like limiting how many "if-then" rules you chain: "IF tenure < 2 AND engagement < 3 AND no promotion AND remote AND ..." |
eta | 0.1 | Learning rate — how much each new tree contributes. Lower = slower but more precise convergence. | Taking small careful steps when adjusting attrition risk scores rather than large jumps |
gamma | 4 | Minimum loss reduction to make a split. Higher = more conservative tree growth (regularization). | Only split a node if it meaningfully separates leavers from stayers — ignore trivial differences |
min_child_weight | 6 | Minimum sum of instance weights in a leaf. Prevents trees from creating rules based on too few examples. | Don't create a rule from just 3 employees — need at least 6 data points to justify a pattern |
subsample | 0.7 | Fraction of training data used per tree. Adds randomness to prevent overfitting (like bagging). | Each tree sees 70% of employees — different random subsets — so no single outlier dominates |
objective | binary:logistic | Loss function for binary classification. Outputs probability between 0 and 1. | Predicting probability of attrition: 0.82 means 82% likely to leave within 12 months |
num_round | 1000 | Number of boosting iterations (trees to build). More rounds with low eta = better generalization. | Building 1000 small correction trees — each one slightly improves the attrition prediction |
gamma + min_child_weight + max_depth work together to prevent overfitting. High gamma prunes weak splits, min_child_weight requires sufficient evidence, and max_depth caps complexity.
eta × num_round = total learning capacity. Low eta (0.1) with high rounds (1000) gives fine-grained convergence. High eta (0.3) with fewer rounds trains faster but may overshoot.
subsample = 0.7 means each tree only sees 70% of rows. This stochastic element reduces variance and makes the ensemble more robust to noisy HR survey data.
max_depth: [3,10] and eta: [0.01, 0.3] across multiple training jobs in parallel.After training completes, SageMaker Debugger automatically generates an XGBoost report containing loss curves, a confusion matrix, classification metrics, and feature importance rankings. This section breaks down how to interpret each component.
CreateXgboostReport rule runs as a separate processing job that can fail with "AlgorithmError: Algorithm container exited with error." This is a platform issue (the debugger rules container image is incompatible with newer environments) — not something you did wrong. If this happens, skip the waiter cell and proceed to Task 2.6. The training itself completed successfully — you can read the metrics directly from the training logs printed above (train-logloss and validation-logloss at each round).Even without the Debugger report, the training output shows logloss at every round. Here's what a successful run looks like:
How to read this: Lower logloss = better predictions. The gap between train and validation widens after round 250 — the model starts memorizing training noise rather than learning generalizable patterns. Lab 4's hyperparameter tuning (with early stopping) would catch this and stop training earlier.
For binary classification (income ≤$50K = 1, >$50K = 0), the confusion matrix shows four outcomes:
Predicted ≤$50K
Actually ≤$50K
Correct identification
Predicted ≤$50K
Actually >$50K
Wasted outreach
Predicted >$50K
Actually ≤$50K
Missed person in need
Predicted >$50K
Actually >$50K
Correct exclusion
| Metric | Formula | What It Measures | HCM Interpretation |
|---|---|---|---|
| Precision | TP / (TP + FP) | Of all predicted positives, how many are actually positive? | "Of employees we flagged as flight risks, how many actually left?" High precision = fewer false alarms. |
| Recall | TP / (TP + FN) | Of all actual positives, how many did we catch? | "Of all employees who left, how many did we identify beforehand?" High recall = fewer missed departures. |
| F1-Score | 2 × (P × R) / (P + R) | Harmonic mean of precision and recall — balances both. | The target metric for Lab 3. Balances "don't cry wolf" (precision) with "don't miss anyone" (recall). |
| Accuracy | (TP + TN) / Total | Overall correct predictions. | Can be misleading with imbalanced classes — if 75% stay, predicting "stay" always gives 75% accuracy. |
The XGBoost report ranks features by how often they're used in tree splits. In the lab dataset (34 features after encoding), the top predictors typically include:
Strong income signal — investment income correlates heavily with earning >$50K
Married-civ-spouse is a top predictor — household economics affect income bracket
Years of education directly correlates with earning potential
Working hours signal full-time vs part-time employment patterns
months_since_promotion, engagement_score, manager_changes_2yr, and salary_percentile. If exit_interview_scheduled ranks #1, that's target leakage — the feature only exists because the person already decided to leave (covered in Lab 1).Every step in Lab 3 maps directly to training an employee attrition prediction model at AnyCompany. Here's how you'd apply the same pattern to predict which employees are likely to leave within the next 12 months.
| Lab 3 (Census Income) | AnyCompany (Attrition) | Key Difference |
|---|---|---|
Target: income (≤$50K = 1) | Target: left_company (left = 1) | Same binary classification pattern |
| 34 features (age, education, occupation...) | 19 features (tenure, engagement, salary...) | AnyCompany has fewer but more domain-specific features |
| ~22K training rows | ~1.4K training rows (from 2000 synthetic) | Real AnyCompany would have 100K+ from HRIS |
| 70/10/20 split (train/val/test) | Same split ratios | Time-based split preferred for attrition (train on past, test on recent) |
XGBoost binary:logistic | XGBoost binary:logistic | Same algorithm — XGBoost excels on tabular HR data |
| F1-Score as target metric | Recall prioritized (don't miss leavers) | Missing a flight risk is costlier than a false alarm at AnyCompany |
| ml.m5.xlarge (single instance) | ml.m5.xlarge sufficient for <1M rows | Scale to ml.m5.4xlarge for full AnyCompany dataset |
Strongest predictor. Low engagement (1–3) correlates with 4× higher attrition. Equivalent to capital_gain in the lab.
Employees stuck >24 months without promotion show elevated flight risk. Time-based feature with clear business logic.
Below-market compensation (percentile <30) drives departures. Especially impactful for IC3–IC5 levels.
Frequent manager changes (≥3 in 2 years) signal organizational instability. Moderate predictor with clear intervention path.