Evaluating & Tuning ML Models - Module 7 | AnyCompany ML Engineering

⚖️ Bias-Variance Trade-off

Every ML model has two sources of prediction error: bias (systematic inaccuracy) and variance (sensitivity to training data). The ideal model minimizes both. Understanding this trade-off is essential for diagnosing why a model underperforms and choosing the right fix.

🎯

Think of it like archery targets. Bias = how far your cluster of arrows is from the bullseye (systematic offset). Variance = how spread out your arrows are (inconsistency). The goal: tight cluster centered on the bullseye (low bias + low variance).

The Four Quadrants

🎯

Low Bias + Low Variance (Ideal)

Accurate AND consistent. Predictions cluster tightly around the true value. The model has learned real patterns and generalizes well to unseen data.

📏

High Bias + Low Variance

Consistently wrong in the same direction. Underfitting. Like predicting every employee has 10% attrition risk regardless of their actual situation. Tightly clustered but off-target.

🎲

Low Bias + High Variance

Correct on average but wildly inconsistent. Overfitting. Model gives different predictions for similar employees depending on which training sample it saw. Centered but scattered.

💥

High Bias + High Variance

Wrong AND inconsistent. The worst case. Model is both too simple to capture patterns and too sensitive to noise. Start over with better features or algorithm.

🔍 Detection & Fixes

You detect bias and variance by comparing training set performance against validation set performance. The gap between them tells you which problem dominates.

Problem	How to Detect	Common Causes	Fixes
High Bias (Underfitting)	Low accuracy on BOTH train and validation sets	Model too simple, wrong features, insufficient training, inherited bias from dataset	Add features, use more complex model, train longer, fix feature engineering
High Variance (Overfitting)	High train accuracy, LOW validation accuracy (big gap)	Model too complex, too many irrelevant features, trained too long on training data	Regularization (L1/L2), reduce features, early stopping, more diverse training data

AnyCompany Attrition Model Example

Underfitting (high bias): Using only tenure to predict attrition. Train accuracy: 62%, Val accuracy: 61%. Model is too simple — needs compensation, performance, and management features.

Overfitting (high variance): Using 200 features including employee ID patterns. Train accuracy: 98%, Val accuracy: 72%. The 26% gap reveals memorization instead of learning.

Good fit: 25 curated features (tenure, salary percentile, performance, manager tenure). Train: 87%, Val: 85%. Small gap = good generalization to unseen employees.

📋 Building a Performance Baseline

Before tuning, establish a reference point. Three best practices from the course material:

📐

Establish Evaluation Metrics

Choose metrics aligned with business objectives BEFORE training. For fraud: recall. For attrition alerts: precision. For salary: RMSE. The metric defines what "good" means.

💻

Start with a Simple Model

Train a basic model first (linear regression, simple decision tree). This baseline reveals problem complexity and sets a floor. If the simple model scores 80%, you know the ceiling is achievable.

🌍

Use Real-World Data

Evaluation datasets must represent production data distribution. No leakage, no bias, no synthetic shortcuts. At AnyCompany: use recent employee data, not historical snapshots from a different era.

📊 Classification Evaluation Metrics

For classification models (fraud/not fraud, stay/leave), accuracy alone is misleading — especially with imbalanced targets. If only 0.2% of transactions are fraudulent, a model that always predicts "not fraud" achieves 99.8% accuracy while catching zero fraud. You need precision, recall, and F1 to understand WHERE your model fails.

⚠️

Target imbalance is the norm at AnyCompany. Fraud is rare (~0.1% of transactions). Attrition affects ~15% of employees. High-income earners are a small minority. Always look beyond accuracy — use metrics that account for the cost of different error types.

The Confusion Matrix

For binary classification, every prediction falls into one of four outcomes. For multiclass problems (n classes), the matrix becomes n×n.

	Predicted: Positive	Predicted: Negative
Actual: Positive	True Positive (TP): 1620 Correctly identified fraud	False Negative (FN): 128 Missed fraud (dangerous)
Actual: Negative	False Positive (FP): 199 False alarm (annoying)	True Negative (TN): 423 Correctly cleared

Key Metrics Explained

Metric	Formula	Value	What It Tells You	HCM Priority
Accuracy	(TP+TN) / Total	86.2%	Overall correctness	Misleading when classes are imbalanced (fraud is rare)
Precision	TP / (TP+FP)	89.0%	When model says "fraud," how often is it right?	High precision = fewer false alarms for compliance team
Recall	TP / (TP+FN)	92.7%	Of all actual fraud, how much did we catch?	High recall = fewer missed fraud cases (critical for payroll)
F1 Score	2(PR)/(P+R)	90.8%	Harmonic mean of precision and recall	Balanced metric when both false alarms and misses matter

⚠️

At AnyCompany, recall matters most for fraud detection. A missed fraudulent transaction (FN) costs real money. A false alarm (FP) just triggers a review. Optimize for recall, accept slightly lower precision. For attrition prediction, precision matters more - you do not want to alarm managers about employees who are actually happy.

📈 ROC Curves & Classification Thresholds

Classification models don't output a hard yes/no — they output a probability between 0.0 and 1.0. The threshold you choose converts that probability into a decision. ROC curves plot the True Positive Rate vs False Positive Rate across ALL possible thresholds, revealing the model's discrimination ability at every operating point.

💡

The "knee" of the ROC curve indicates the optimal threshold — the point where you get the most true positives with the fewest false positives. Steeper rise = better model. The area under the entire curve (AUC) captures this quality as a single number.

Understanding AUC-ROC

🏆

AUC = 1.0 (Perfect)

Model perfectly separates all positives from negatives at every threshold. Unrealistic in practice but the theoretical ideal.

✅

AUC = 0.92 (Excellent)

Strong discrimination ability. At most thresholds, the model correctly ranks positives above negatives. Production-ready for most HCM use cases.

⚠️

AUC = 0.78 (Good)

Decent but room for improvement. May need better features or a more complex model. Acceptable for low-stakes predictions.

🎲

AUC = 0.5 (Random)

No better than flipping a coin. Model has learned nothing useful. Go back to feature engineering or try a different algorithm.

🎚️ Choosing the Right Threshold

Threshold	Effect	When to Use	AnyCompany Example
Low (0.2)	High recall, low precision (catch everything, many false alarms)	Cost of missing a positive is very high	Payroll fraud: flag anything suspicious, review manually
Medium (0.5)	Balanced precision and recall	Default starting point, equal cost of errors	Support ticket routing: balanced accuracy across categories
High (0.8)	High precision, low recall (only flag when very confident)	Cost of false alarm is high	Attrition alerts to executives: only flag when very certain

Threshold Strategy at AnyCompany

Payroll Fraud Model: Threshold = 0.35 (aggressive). Rather flag 200 legitimate transactions for review than miss 1 actual fraud case worth $50K+.

Attrition Alert Model: Threshold = 0.75 (conservative). Only alert managers when model is 75%+ confident. False alarms erode trust in the system.

🚀 Reducing Training Time

Large models and massive datasets can take days to train on a single machine. ML models are becoming increasingly complex, and reducing training time supports faster iteration, lower costs, and quicker time-to-production. Two key techniques: early stopping (stop when done) and distributed training (use more machines).

Early Stopping

🛑

How It Works

Monitor validation metric each epoch. If it stops improving for N consecutive epochs (patience), halt training. Saves compute AND prevents overfitting — a regularization technique that preserves generalization ability.

📊

SageMaker Implementation

SageMaker evaluates the objective metric each epoch, calculates the median of running averages from previous jobs, and stops if current metric is worse than the median. Supported for XGBoost, Linear Learner, CatBoost, and custom algorithms that emit metrics.

⚙️

Configuration: Set TrainingJobEarlyStoppingType to AUTO via Boto3, or early_stopping_type='Auto' in the SageMaker Python SDK. Choose stopping criteria carefully — too aggressive and you underfit, too lenient and you waste compute.

🌐 Distributed Training

Split work across multiple instances to train faster. The approach depends on whether your bottleneck is data size (too many samples) or model size (too many parameters to fit in one GPU's memory).

Approach	How It Works	When to Use	AnyCompany Example
Data Parallelism	Same model replicated on each GPU. Different data subsets per device. Gradients averaged across instances after each batch.	Large datasets, model fits in single GPU memory	Training image classifier on millions of document scans (SMDDP library)
Model Parallelism	Model split across GPUs. Each device holds part of the parameters. Tensor parallelism splits large matrix multiplications.	Model too large for single GPU (billions of parameters)	Fine-tuning LLM for AnyCompany Assist (SageMaker model parallelism library)
Hybrid Parallel	Both data and model split across instances. Massive datasets distributed + model partitioned.	Massive model + massive dataset	Training foundation model on all AnyCompany support transcripts

💡

Decision rule: Start with data parallelism for large datasets. If you run out of GPU memory during training, switch to model parallelism. For most AnyCompany tabular models (XGBoost, Linear), you don't need distributed training at all — a single ml.m5.xlarge trains in minutes.

🏗️ SageMaker Infrastructure for Large-Scale Training

🛡️

Resilient Environment

Self-healing clusters that automatically recover from hardware failures. Can reduce training time by up to 20% by avoiding full restarts on transient errors.

📚

Distributed Training Libraries

SMDDP (data parallelism) and SMP (model parallelism) libraries improve performance by up to 20%. Optimized AllReduce for gradient synchronization across instances.

🔧

SageMaker HyperPod

For the largest training jobs requiring full control over compute environment and workload scheduling. Optimized resource utilization for foundation model training.

🎛️ Hyperparameter Tuning Strategies

Hyperparameters are algorithm-specific knobs that control the training process. Applying different values with the same data yields different model variants. Finding optimal hyperparameters adjusts the bias-variance trade-off, protects against overfitting, and controls time and cost. Traditionally done manually by domain experts — now automated.

XGBoost Hyperparameter Categories

Lab 4 uses XGBoost. Its hyperparameters fall into three categories:

Category	Key Hyperparameters	What They Control
General	`booster` (gbtree, gblinear, dart)	Type of base algorithm added each iteration
Booster-Specific	`num_round`, `max_depth`, `min_child_weight`, `alpha`	Number of trees, tree complexity, regularization
Learning Task	`objective`, `eta` (learning rate), `eval_metric`	Loss function, step size, evaluation criteria

Four Tuning Strategies

🔲

Grid Search

Try every combination in a predefined grid. Exhaustive but exponentially expensive. Only practical with few hyperparameters and small ranges. Guaranteed to find the best value in the grid.

🎲

Random Search

Sample random combinations from ranges. Surprisingly effective — often finds good values faster than grid search because it explores more of the space. Good default choice. Risk: best set could be missed.

🧠

Bayesian Optimization

Uses probabilistic models (Gaussian processes) to predict which configurations are likely to perform well. Learns from previous trials. Most sample-efficient for expensive training jobs. Converges quickly but complex to scale.

⚡

Hyperband

Two phases: Exploration (train many configs briefly, stop when improvement plateaus) → Exploitation (drop worst 50%, allocate more epochs to survivors). Repeats until final set remains. Best for iterative algorithms (neural networks).

🔧 SageMaker Automatic Model Tuning (AMT)

AMT uses ML to find the best hyperparameters automatically. You specify ranges and an objective — AMT runs many training jobs and identifies the optimal configuration. Supports up to 30 hyperparameters with categorical, integer, or continuous ranges.

Configuration	What You Specify	Example (XGBoost Attrition)
Tuning Strategy	Bayesian, Random, Hyperband, or Grid	Bayesian (most efficient for expensive jobs)
Hyperparameter Ranges	Min/max for each hyperparameter (static or dynamic)	max_depth: [3,10], eta: [0.01,0.3], n_estimators: [100,500]
Objective Metric	What to optimize (minimize or maximize)	Maximize validation:auc
Completion Criteria	When to stop: max jobs, max runtime, or target metric value	Max 20 jobs OR AUC reaches 0.93
Parallel Jobs	How many to run simultaneously	4 parallel (faster wall-clock time)
Early Stopping	Kill unpromising jobs before completion	AUTO (saves cost on bad configurations)

Lab 4 Architecture

Flow: SageMaker Studio notebook → Define objective metric + tuning strategy → AMT launches N training jobs using XGBoost container → Each job outputs model artifacts to S3 + logs to CloudWatch → AMT identifies best-performing configuration

Model: XGBoost attrition classifier

Strategy: Bayesian optimization (20 jobs, 4 parallel)

Result: Best AUC improved from 0.85 (default params) to 0.91 (tuned: max_depth=7, eta=0.05, n_estimators=340)

Cost: 20 training jobs × $2 each = $40 total. 6% accuracy improvement for $40 is excellent ROI.

🎯

Autotune: SageMaker also offers an autotune feature that finds optimal values without you manually specifying hyperparameter ranges, resources, or objective metrics. Useful when you're unsure which hyperparameters to tune or what ranges to use.

🎮 Metric Explorer

Explore how different evaluation metrics apply to AnyCompany use cases. Select a scenario to see which metrics matter most and why.

🛡️

Payroll Fraud Detection

Binary classification where missing fraud is catastrophic but false alarms are tolerable.

👤

Employee Attrition Alert

Binary classification where false alarms erode manager trust in the system.

💰

Salary Prediction

Regression where prediction error is measured in dollars - no confusion matrix needed.

🎫

Ticket Routing (Multi-class)

Multi-class classification where each misroute adds delay and frustration.

📋 Payroll Fraud Detection: Optimize for recall (catch all fraud). Use a low threshold (0.3). Accept more false positives - the compliance team can review flagged transactions. A single missed $50K fraud costs more than 100 false alarm reviews. Target: Recall > 95%, AUC > 0.92.

Aspect	Recommendation
Primary Metric	Recall (catch rate)
Secondary Metric	AUC-ROC (overall discrimination)
Threshold	Low (0.3) - flag aggressively
Acceptable FP Rate	Up to 10% (review cost is low)
Target Performance	Recall > 95%, Precision > 70%
Business Impact	Each missed fraud = $50K+ loss. Each false alarm = $5 review cost.

📝 Module Summary

After this module, you should be able to define and interpret evaluation metrics, explain techniques to reduce training time, and describe how hyperparameter tuning affects model performance.

✅

Bias-Variance

Low bias + low variance is ideal. Detect underfitting (low train + val) vs overfitting (high train, low val). Use validation sets and baselines to diagnose.

✅

Evaluation Metrics

Accuracy, precision, recall, F1, AUC-ROC. Choose based on business cost of different error types. Threshold converts probability to decision.

✅

Training Efficiency

Early stopping prevents overfitting and saves compute. Distributed training (data/model/hybrid parallel) scales to large datasets and models. SageMaker HyperPod for the largest jobs.

✅

Hyperparameter Tuning

Grid (exhaustive), Random (efficient), Bayesian (intelligent), Hyperband (iterative). SageMaker AMT automates the search with completion criteria and early stopping.

🧭

What's next: Module 8 covers model deployment strategies — how to get your tuned model into production with real-time endpoints, batch transforms, and traffic shifting strategies for zero-downtime updates.