Module 7 - Interactive Explainer
Master evaluation metrics, understand bias-variance trade-offs, reduce training time with distributed computing, and optimize hyperparameters for production HCM models.
Every ML model has two sources of prediction error: bias (systematic inaccuracy) and variance (sensitivity to training data). The ideal model minimizes both. Understanding this trade-off is essential for diagnosing why a model underperforms and choosing the right fix.
Accurate AND consistent. Predictions cluster tightly around the true value. The model has learned real patterns and generalizes well to unseen data.
Consistently wrong in the same direction. Underfitting. Like predicting every employee has 10% attrition risk regardless of their actual situation. Tightly clustered but off-target.
Correct on average but wildly inconsistent. Overfitting. Model gives different predictions for similar employees depending on which training sample it saw. Centered but scattered.
Wrong AND inconsistent. The worst case. Model is both too simple to capture patterns and too sensitive to noise. Start over with better features or algorithm.
You detect bias and variance by comparing training set performance against validation set performance. The gap between them tells you which problem dominates.
| Problem | How to Detect | Common Causes | Fixes |
|---|---|---|---|
| High Bias (Underfitting) | Low accuracy on BOTH train and validation sets | Model too simple, wrong features, insufficient training, inherited bias from dataset | Add features, use more complex model, train longer, fix feature engineering |
| High Variance (Overfitting) | High train accuracy, LOW validation accuracy (big gap) | Model too complex, too many irrelevant features, trained too long on training data | Regularization (L1/L2), reduce features, early stopping, more diverse training data |
Underfitting (high bias): Using only tenure to predict attrition. Train accuracy: 62%, Val accuracy: 61%. Model is too simple โ needs compensation, performance, and management features.
Overfitting (high variance): Using 200 features including employee ID patterns. Train accuracy: 98%, Val accuracy: 72%. The 26% gap reveals memorization instead of learning.
Good fit: 25 curated features (tenure, salary percentile, performance, manager tenure). Train: 87%, Val: 85%. Small gap = good generalization to unseen employees.
Before tuning, establish a reference point. Three best practices from the course material:
Choose metrics aligned with business objectives BEFORE training. For fraud: recall. For attrition alerts: precision. For salary: RMSE. The metric defines what "good" means.
Train a basic model first (linear regression, simple decision tree). This baseline reveals problem complexity and sets a floor. If the simple model scores 80%, you know the ceiling is achievable.
Evaluation datasets must represent production data distribution. No leakage, no bias, no synthetic shortcuts. At AnyCompany: use recent employee data, not historical snapshots from a different era.
For classification models (fraud/not fraud, stay/leave), accuracy alone is misleading โ especially with imbalanced targets. If only 0.2% of transactions are fraudulent, a model that always predicts "not fraud" achieves 99.8% accuracy while catching zero fraud. You need precision, recall, and F1 to understand WHERE your model fails.
For binary classification, every prediction falls into one of four outcomes. For multiclass problems (n classes), the matrix becomes nรn.
| Predicted: Positive | Predicted: Negative | |
|---|---|---|
| Actual: Positive | True Positive (TP): 1620 Correctly identified fraud | False Negative (FN): 128 Missed fraud (dangerous) |
| Actual: Negative | False Positive (FP): 199 False alarm (annoying) | True Negative (TN): 423 Correctly cleared |
| Metric | Formula | Value | What It Tells You | HCM Priority |
|---|---|---|---|---|
| Accuracy | (TP+TN) / Total | 86.2% | Overall correctness | Misleading when classes are imbalanced (fraud is rare) |
| Precision | TP / (TP+FP) | 89.0% | When model says "fraud," how often is it right? | High precision = fewer false alarms for compliance team |
| Recall | TP / (TP+FN) | 92.7% | Of all actual fraud, how much did we catch? | High recall = fewer missed fraud cases (critical for payroll) |
| F1 Score | 2*(P*R)/(P+R) | 90.8% | Harmonic mean of precision and recall | Balanced metric when both false alarms and misses matter |
Classification models don't output a hard yes/no โ they output a probability between 0.0 and 1.0. The threshold you choose converts that probability into a decision. ROC curves plot the True Positive Rate vs False Positive Rate across ALL possible thresholds, revealing the model's discrimination ability at every operating point.
Model perfectly separates all positives from negatives at every threshold. Unrealistic in practice but the theoretical ideal.
Strong discrimination ability. At most thresholds, the model correctly ranks positives above negatives. Production-ready for most HCM use cases.
Decent but room for improvement. May need better features or a more complex model. Acceptable for low-stakes predictions.
No better than flipping a coin. Model has learned nothing useful. Go back to feature engineering or try a different algorithm.
| Threshold | Effect | When to Use | AnyCompany Example |
|---|---|---|---|
| Low (0.2) | High recall, low precision (catch everything, many false alarms) | Cost of missing a positive is very high | Payroll fraud: flag anything suspicious, review manually |
| Medium (0.5) | Balanced precision and recall | Default starting point, equal cost of errors | Support ticket routing: balanced accuracy across categories |
| High (0.8) | High precision, low recall (only flag when very confident) | Cost of false alarm is high | Attrition alerts to executives: only flag when very certain |
Payroll Fraud Model: Threshold = 0.35 (aggressive). Rather flag 200 legitimate transactions for review than miss 1 actual fraud case worth $50K+.
Attrition Alert Model: Threshold = 0.75 (conservative). Only alert managers when model is 75%+ confident. False alarms erode trust in the system.
Large models and massive datasets can take days to train on a single machine. ML models are becoming increasingly complex, and reducing training time supports faster iteration, lower costs, and quicker time-to-production. Two key techniques: early stopping (stop when done) and distributed training (use more machines).
Monitor validation metric each epoch. If it stops improving for N consecutive epochs (patience), halt training. Saves compute AND prevents overfitting โ a regularization technique that preserves generalization ability.
SageMaker evaluates the objective metric each epoch, calculates the median of running averages from previous jobs, and stops if current metric is worse than the median. Supported for XGBoost, Linear Learner, CatBoost, and custom algorithms that emit metrics.
TrainingJobEarlyStoppingType to AUTO via Boto3, or early_stopping_type='Auto' in the SageMaker Python SDK. Choose stopping criteria carefully โ too aggressive and you underfit, too lenient and you waste compute.Split work across multiple instances to train faster. The approach depends on whether your bottleneck is data size (too many samples) or model size (too many parameters to fit in one GPU's memory).
| Approach | How It Works | When to Use | AnyCompany Example |
|---|---|---|---|
| Data Parallelism | Same model replicated on each GPU. Different data subsets per device. Gradients averaged across instances after each batch. | Large datasets, model fits in single GPU memory | Training image classifier on millions of document scans (SMDDP library) |
| Model Parallelism | Model split across GPUs. Each device holds part of the parameters. Tensor parallelism splits large matrix multiplications. | Model too large for single GPU (billions of parameters) | Fine-tuning LLM for AnyCompany Assist (SageMaker model parallelism library) |
| Hybrid Parallel | Both data and model split across instances. Massive datasets distributed + model partitioned. | Massive model + massive dataset | Training foundation model on all AnyCompany support transcripts |
Self-healing clusters that automatically recover from hardware failures. Can reduce training time by up to 20% by avoiding full restarts on transient errors.
SMDDP (data parallelism) and SMP (model parallelism) libraries improve performance by up to 20%. Optimized AllReduce for gradient synchronization across instances.
For the largest training jobs requiring full control over compute environment and workload scheduling. Optimized resource utilization for foundation model training.
Hyperparameters are algorithm-specific knobs that control the training process. Applying different values with the same data yields different model variants. Finding optimal hyperparameters adjusts the bias-variance trade-off, protects against overfitting, and controls time and cost. Traditionally done manually by domain experts โ now automated.
Lab 4 uses XGBoost. Its hyperparameters fall into three categories:
| Category | Key Hyperparameters | What They Control |
|---|---|---|
| General | booster (gbtree, gblinear, dart) | Type of base algorithm added each iteration |
| Booster-Specific | num_round, max_depth, min_child_weight, alpha | Number of trees, tree complexity, regularization |
| Learning Task | objective, eta (learning rate), eval_metric | Loss function, step size, evaluation criteria |
Try every combination in a predefined grid. Exhaustive but exponentially expensive. Only practical with few hyperparameters and small ranges. Guaranteed to find the best value in the grid.
Sample random combinations from ranges. Surprisingly effective โ often finds good values faster than grid search because it explores more of the space. Good default choice. Risk: best set could be missed.
Uses probabilistic models (Gaussian processes) to predict which configurations are likely to perform well. Learns from previous trials. Most sample-efficient for expensive training jobs. Converges quickly but complex to scale.
Two phases: Exploration (train many configs briefly, stop when improvement plateaus) โ Exploitation (drop worst 50%, allocate more epochs to survivors). Repeats until final set remains. Best for iterative algorithms (neural networks).
AMT uses ML to find the best hyperparameters automatically. You specify ranges and an objective โ AMT runs many training jobs and identifies the optimal configuration. Supports up to 30 hyperparameters with categorical, integer, or continuous ranges.
| Configuration | What You Specify | Example (XGBoost Attrition) |
|---|---|---|
| Tuning Strategy | Bayesian, Random, Hyperband, or Grid | Bayesian (most efficient for expensive jobs) |
| Hyperparameter Ranges | Min/max for each hyperparameter (static or dynamic) | max_depth: [3,10], eta: [0.01,0.3], n_estimators: [100,500] |
| Objective Metric | What to optimize (minimize or maximize) | Maximize validation:auc |
| Completion Criteria | When to stop: max jobs, max runtime, or target metric value | Max 20 jobs OR AUC reaches 0.93 |
| Parallel Jobs | How many to run simultaneously | 4 parallel (faster wall-clock time) |
| Early Stopping | Kill unpromising jobs before completion | AUTO (saves cost on bad configurations) |
Flow: SageMaker Studio notebook โ Define objective metric + tuning strategy โ AMT launches N training jobs using XGBoost container โ Each job outputs model artifacts to S3 + logs to CloudWatch โ AMT identifies best-performing configuration
Model: XGBoost attrition classifier
Strategy: Bayesian optimization (20 jobs, 4 parallel)
Result: Best AUC improved from 0.85 (default params) to 0.91 (tuned: max_depth=7, eta=0.05, n_estimators=340)
Cost: 20 training jobs ร $2 each = $40 total. 6% accuracy improvement for $40 is excellent ROI.
Explore how different evaluation metrics apply to AnyCompany use cases. Select a scenario to see which metrics matter most and why.
Binary classification where missing fraud is catastrophic but false alarms are tolerable.
Binary classification where false alarms erode manager trust in the system.
Regression where prediction error is measured in dollars - no confusion matrix needed.
Multi-class classification where each misroute adds delay and frustration.
| Aspect | Recommendation |
|---|---|
| Primary Metric | Recall (catch rate) |
| Secondary Metric | AUC-ROC (overall discrimination) |
| Threshold | Low (0.3) - flag aggressively |
| Acceptable FP Rate | Up to 10% (review cost is low) |
| Target Performance | Recall > 95%, Precision > 70% |
| Business Impact | Each missed fraud = $50K+ loss. Each false alarm = $5 review cost. |
After this module, you should be able to define and interpret evaluation metrics, explain techniques to reduce training time, and describe how hyperparameter tuning affects model performance.
Low bias + low variance is ideal. Detect underfitting (low train + val) vs overfitting (high train, low val). Use validation sets and baselines to diagnose.
Accuracy, precision, recall, F1, AUC-ROC. Choose based on business cost of different error types. Threshold converts probability to decision.
Early stopping prevents overfitting and saves compute. Distributed training (data/model/hybrid parallel) scales to large datasets and models. SageMaker HyperPod for the largest jobs.
Grid (exhaustive), Random (efficient), Bayesian (intelligent), Hyperband (iterative). SageMaker AMT automates the search with completion criteria and early stopping.