Module 7 - Interactive Explainer

Evaluating & Tuning ML Models

Master evaluation metrics, understand bias-variance trade-offs, reduce training time with distributed computing, and optimize hyperparameters for production HCM models.

๐Ÿ“ Evaluation โšก Interactive ๐Ÿข HCM Context ๐Ÿงช Lab 4

โš–๏ธ Bias-Variance Trade-off

Every ML model has two sources of prediction error: bias (systematic inaccuracy) and variance (sensitivity to training data). The ideal model minimizes both. Understanding this trade-off is essential for diagnosing why a model underperforms and choosing the right fix.

๐ŸŽฏ
Think of it like archery targets. Bias = how far your cluster of arrows is from the bullseye (systematic offset). Variance = how spread out your arrows are (inconsistency). The goal: tight cluster centered on the bullseye (low bias + low variance).

The Four Quadrants

๐ŸŽฏ

Low Bias + Low Variance (Ideal)

Accurate AND consistent. Predictions cluster tightly around the true value. The model has learned real patterns and generalizes well to unseen data.

๐Ÿ“

High Bias + Low Variance

Consistently wrong in the same direction. Underfitting. Like predicting every employee has 10% attrition risk regardless of their actual situation. Tightly clustered but off-target.

๐ŸŽฒ

Low Bias + High Variance

Correct on average but wildly inconsistent. Overfitting. Model gives different predictions for similar employees depending on which training sample it saw. Centered but scattered.

๐Ÿ’ฅ

High Bias + High Variance

Wrong AND inconsistent. The worst case. Model is both too simple to capture patterns and too sensitive to noise. Start over with better features or algorithm.

๐Ÿ” Detection & Fixes

You detect bias and variance by comparing training set performance against validation set performance. The gap between them tells you which problem dominates.

ProblemHow to DetectCommon CausesFixes
High Bias (Underfitting)Low accuracy on BOTH train and validation setsModel too simple, wrong features, insufficient training, inherited bias from datasetAdd features, use more complex model, train longer, fix feature engineering
High Variance (Overfitting)High train accuracy, LOW validation accuracy (big gap)Model too complex, too many irrelevant features, trained too long on training dataRegularization (L1/L2), reduce features, early stopping, more diverse training data
AnyCompany Attrition Model Example

Underfitting (high bias): Using only tenure to predict attrition. Train accuracy: 62%, Val accuracy: 61%. Model is too simple โ€” needs compensation, performance, and management features.

Overfitting (high variance): Using 200 features including employee ID patterns. Train accuracy: 98%, Val accuracy: 72%. The 26% gap reveals memorization instead of learning.

Good fit: 25 curated features (tenure, salary percentile, performance, manager tenure). Train: 87%, Val: 85%. Small gap = good generalization to unseen employees.

๐Ÿ“‹ Building a Performance Baseline

Before tuning, establish a reference point. Three best practices from the course material:

๐Ÿ“

Establish Evaluation Metrics

Choose metrics aligned with business objectives BEFORE training. For fraud: recall. For attrition alerts: precision. For salary: RMSE. The metric defines what "good" means.

๐Ÿ’ป

Start with a Simple Model

Train a basic model first (linear regression, simple decision tree). This baseline reveals problem complexity and sets a floor. If the simple model scores 80%, you know the ceiling is achievable.

๐ŸŒ

Use Real-World Data

Evaluation datasets must represent production data distribution. No leakage, no bias, no synthetic shortcuts. At AnyCompany: use recent employee data, not historical snapshots from a different era.

๐Ÿ“Š Classification Evaluation Metrics

For classification models (fraud/not fraud, stay/leave), accuracy alone is misleading โ€” especially with imbalanced targets. If only 0.2% of transactions are fraudulent, a model that always predicts "not fraud" achieves 99.8% accuracy while catching zero fraud. You need precision, recall, and F1 to understand WHERE your model fails.

โš ๏ธ
Target imbalance is the norm at AnyCompany. Fraud is rare (~0.1% of transactions). Attrition affects ~15% of employees. High-income earners are a small minority. Always look beyond accuracy โ€” use metrics that account for the cost of different error types.

The Confusion Matrix

For binary classification, every prediction falls into one of four outcomes. For multiclass problems (n classes), the matrix becomes nร—n.

Predicted: PositivePredicted: Negative
Actual: PositiveTrue Positive (TP): 1620
Correctly identified fraud
False Negative (FN): 128
Missed fraud (dangerous)
Actual: NegativeFalse Positive (FP): 199
False alarm (annoying)
True Negative (TN): 423
Correctly cleared

Key Metrics Explained

MetricFormulaValueWhat It Tells YouHCM Priority
Accuracy(TP+TN) / Total86.2%Overall correctnessMisleading when classes are imbalanced (fraud is rare)
PrecisionTP / (TP+FP)89.0%When model says "fraud," how often is it right?High precision = fewer false alarms for compliance team
RecallTP / (TP+FN)92.7%Of all actual fraud, how much did we catch?High recall = fewer missed fraud cases (critical for payroll)
F1 Score2*(P*R)/(P+R)90.8%Harmonic mean of precision and recallBalanced metric when both false alarms and misses matter
โš ๏ธ
At AnyCompany, recall matters most for fraud detection. A missed fraudulent transaction (FN) costs real money. A false alarm (FP) just triggers a review. Optimize for recall, accept slightly lower precision. For attrition prediction, precision matters more - you do not want to alarm managers about employees who are actually happy.

๐Ÿ“ˆ ROC Curves & Classification Thresholds

Classification models don't output a hard yes/no โ€” they output a probability between 0.0 and 1.0. The threshold you choose converts that probability into a decision. ROC curves plot the True Positive Rate vs False Positive Rate across ALL possible thresholds, revealing the model's discrimination ability at every operating point.

๐Ÿ’ก
The "knee" of the ROC curve indicates the optimal threshold โ€” the point where you get the most true positives with the fewest false positives. Steeper rise = better model. The area under the entire curve (AUC) captures this quality as a single number.

Understanding AUC-ROC

๐Ÿ†

AUC = 1.0 (Perfect)

Model perfectly separates all positives from negatives at every threshold. Unrealistic in practice but the theoretical ideal.

โœ…

AUC = 0.92 (Excellent)

Strong discrimination ability. At most thresholds, the model correctly ranks positives above negatives. Production-ready for most HCM use cases.

โš ๏ธ

AUC = 0.78 (Good)

Decent but room for improvement. May need better features or a more complex model. Acceptable for low-stakes predictions.

๐ŸŽฒ

AUC = 0.5 (Random)

No better than flipping a coin. Model has learned nothing useful. Go back to feature engineering or try a different algorithm.

๐ŸŽš๏ธ Choosing the Right Threshold

ThresholdEffectWhen to UseAnyCompany Example
Low (0.2)High recall, low precision (catch everything, many false alarms)Cost of missing a positive is very highPayroll fraud: flag anything suspicious, review manually
Medium (0.5)Balanced precision and recallDefault starting point, equal cost of errorsSupport ticket routing: balanced accuracy across categories
High (0.8)High precision, low recall (only flag when very confident)Cost of false alarm is highAttrition alerts to executives: only flag when very certain
Threshold Strategy at AnyCompany

Payroll Fraud Model: Threshold = 0.35 (aggressive). Rather flag 200 legitimate transactions for review than miss 1 actual fraud case worth $50K+.

Attrition Alert Model: Threshold = 0.75 (conservative). Only alert managers when model is 75%+ confident. False alarms erode trust in the system.

๐Ÿš€ Reducing Training Time

Large models and massive datasets can take days to train on a single machine. ML models are becoming increasingly complex, and reducing training time supports faster iteration, lower costs, and quicker time-to-production. Two key techniques: early stopping (stop when done) and distributed training (use more machines).

Early Stopping

๐Ÿ›‘

How It Works

Monitor validation metric each epoch. If it stops improving for N consecutive epochs (patience), halt training. Saves compute AND prevents overfitting โ€” a regularization technique that preserves generalization ability.

๐Ÿ“Š

SageMaker Implementation

SageMaker evaluates the objective metric each epoch, calculates the median of running averages from previous jobs, and stops if current metric is worse than the median. Supported for XGBoost, Linear Learner, CatBoost, and custom algorithms that emit metrics.

โš™๏ธ
Configuration: Set TrainingJobEarlyStoppingType to AUTO via Boto3, or early_stopping_type='Auto' in the SageMaker Python SDK. Choose stopping criteria carefully โ€” too aggressive and you underfit, too lenient and you waste compute.

๐ŸŒ Distributed Training

Split work across multiple instances to train faster. The approach depends on whether your bottleneck is data size (too many samples) or model size (too many parameters to fit in one GPU's memory).

ApproachHow It WorksWhen to UseAnyCompany Example
Data ParallelismSame model replicated on each GPU. Different data subsets per device. Gradients averaged across instances after each batch.Large datasets, model fits in single GPU memoryTraining image classifier on millions of document scans (SMDDP library)
Model ParallelismModel split across GPUs. Each device holds part of the parameters. Tensor parallelism splits large matrix multiplications.Model too large for single GPU (billions of parameters)Fine-tuning LLM for AnyCompany Assist (SageMaker model parallelism library)
Hybrid ParallelBoth data and model split across instances. Massive datasets distributed + model partitioned.Massive model + massive datasetTraining foundation model on all AnyCompany support transcripts
๐Ÿ’ก
Decision rule: Start with data parallelism for large datasets. If you run out of GPU memory during training, switch to model parallelism. For most AnyCompany tabular models (XGBoost, Linear), you don't need distributed training at all โ€” a single ml.m5.xlarge trains in minutes.

๐Ÿ—๏ธ SageMaker Infrastructure for Large-Scale Training

๐Ÿ›ก๏ธ

Resilient Environment

Self-healing clusters that automatically recover from hardware failures. Can reduce training time by up to 20% by avoiding full restarts on transient errors.

๐Ÿ“š

Distributed Training Libraries

SMDDP (data parallelism) and SMP (model parallelism) libraries improve performance by up to 20%. Optimized AllReduce for gradient synchronization across instances.

๐Ÿ”ง

SageMaker HyperPod

For the largest training jobs requiring full control over compute environment and workload scheduling. Optimized resource utilization for foundation model training.

๐ŸŽ›๏ธ Hyperparameter Tuning Strategies

Hyperparameters are algorithm-specific knobs that control the training process. Applying different values with the same data yields different model variants. Finding optimal hyperparameters adjusts the bias-variance trade-off, protects against overfitting, and controls time and cost. Traditionally done manually by domain experts โ€” now automated.

XGBoost Hyperparameter Categories

Lab 4 uses XGBoost. Its hyperparameters fall into three categories:

CategoryKey HyperparametersWhat They Control
Generalbooster (gbtree, gblinear, dart)Type of base algorithm added each iteration
Booster-Specificnum_round, max_depth, min_child_weight, alphaNumber of trees, tree complexity, regularization
Learning Taskobjective, eta (learning rate), eval_metricLoss function, step size, evaluation criteria

Four Tuning Strategies

๐Ÿ”ฒ

Grid Search

Try every combination in a predefined grid. Exhaustive but exponentially expensive. Only practical with few hyperparameters and small ranges. Guaranteed to find the best value in the grid.

๐ŸŽฒ

Random Search

Sample random combinations from ranges. Surprisingly effective โ€” often finds good values faster than grid search because it explores more of the space. Good default choice. Risk: best set could be missed.

๐Ÿง 

Bayesian Optimization

Uses probabilistic models (Gaussian processes) to predict which configurations are likely to perform well. Learns from previous trials. Most sample-efficient for expensive training jobs. Converges quickly but complex to scale.

โšก

Hyperband

Two phases: Exploration (train many configs briefly, stop when improvement plateaus) โ†’ Exploitation (drop worst 50%, allocate more epochs to survivors). Repeats until final set remains. Best for iterative algorithms (neural networks).

๐Ÿ”ง SageMaker Automatic Model Tuning (AMT)

AMT uses ML to find the best hyperparameters automatically. You specify ranges and an objective โ€” AMT runs many training jobs and identifies the optimal configuration. Supports up to 30 hyperparameters with categorical, integer, or continuous ranges.

ConfigurationWhat You SpecifyExample (XGBoost Attrition)
Tuning StrategyBayesian, Random, Hyperband, or GridBayesian (most efficient for expensive jobs)
Hyperparameter RangesMin/max for each hyperparameter (static or dynamic)max_depth: [3,10], eta: [0.01,0.3], n_estimators: [100,500]
Objective MetricWhat to optimize (minimize or maximize)Maximize validation:auc
Completion CriteriaWhen to stop: max jobs, max runtime, or target metric valueMax 20 jobs OR AUC reaches 0.93
Parallel JobsHow many to run simultaneously4 parallel (faster wall-clock time)
Early StoppingKill unpromising jobs before completionAUTO (saves cost on bad configurations)
Lab 4 Architecture

Flow: SageMaker Studio notebook โ†’ Define objective metric + tuning strategy โ†’ AMT launches N training jobs using XGBoost container โ†’ Each job outputs model artifacts to S3 + logs to CloudWatch โ†’ AMT identifies best-performing configuration

Model: XGBoost attrition classifier

Strategy: Bayesian optimization (20 jobs, 4 parallel)

Result: Best AUC improved from 0.85 (default params) to 0.91 (tuned: max_depth=7, eta=0.05, n_estimators=340)

Cost: 20 training jobs ร— $2 each = $40 total. 6% accuracy improvement for $40 is excellent ROI.

๐ŸŽฏ
Autotune: SageMaker also offers an autotune feature that finds optimal values without you manually specifying hyperparameter ranges, resources, or objective metrics. Useful when you're unsure which hyperparameters to tune or what ranges to use.

๐ŸŽฎ Metric Explorer

Explore how different evaluation metrics apply to AnyCompany use cases. Select a scenario to see which metrics matter most and why.

๐Ÿ›ก๏ธ

Payroll Fraud Detection

Binary classification where missing fraud is catastrophic but false alarms are tolerable.

๐Ÿ‘ค

Employee Attrition Alert

Binary classification where false alarms erode manager trust in the system.

๐Ÿ’ฐ

Salary Prediction

Regression where prediction error is measured in dollars - no confusion matrix needed.

๐ŸŽซ

Ticket Routing (Multi-class)

Multi-class classification where each misroute adds delay and frustration.

๐Ÿ“‹ Payroll Fraud Detection: Optimize for recall (catch all fraud). Use a low threshold (0.3). Accept more false positives - the compliance team can review flagged transactions. A single missed $50K fraud costs more than 100 false alarm reviews. Target: Recall > 95%, AUC > 0.92.
AspectRecommendation
Primary MetricRecall (catch rate)
Secondary MetricAUC-ROC (overall discrimination)
ThresholdLow (0.3) - flag aggressively
Acceptable FP RateUp to 10% (review cost is low)
Target PerformanceRecall > 95%, Precision > 70%
Business ImpactEach missed fraud = $50K+ loss. Each false alarm = $5 review cost.

๐Ÿ“ Module Summary

After this module, you should be able to define and interpret evaluation metrics, explain techniques to reduce training time, and describe how hyperparameter tuning affects model performance.

โœ…

Bias-Variance

Low bias + low variance is ideal. Detect underfitting (low train + val) vs overfitting (high train, low val). Use validation sets and baselines to diagnose.

โœ…

Evaluation Metrics

Accuracy, precision, recall, F1, AUC-ROC. Choose based on business cost of different error types. Threshold converts probability to decision.

โœ…

Training Efficiency

Early stopping prevents overfitting and saves compute. Distributed training (data/model/hybrid parallel) scales to large datasets and models. SageMaker HyperPod for the largest jobs.

โœ…

Hyperparameter Tuning

Grid (exhaustive), Random (efficient), Bayesian (intelligent), Hyperband (iterative). SageMaker AMT automates the search with completion criteria and early stopping.

๐Ÿงญ
What's next: Module 8 covers model deployment strategies โ€” how to get your tuned model into production with real-time endpoints, batch transforms, and traffic shifting strategies for zero-downtime updates.