Lab 4 — Interactive Explainer

Hyperparameter Tuning with SageMaker

Configure hyperparameter ranges, launch an automatic tuning job with Bayesian optimization, evaluate 12 training experiments, and select the best model by AUC score.

🎯 Automatic Tuning ⚙️ Bayesian Optimization 📊 AUC Metric 🏢 HCM Context 🧪 Lab 4

📋 Lab 4 Overview

In Lab 3 you trained a single XGBoost model with manually chosen hyperparameters. In this lab you let SageMaker Automatic Model Tuning (AMT) explore the hyperparameter space systematically — launching 12 training jobs across 5 hyperparameter ranges to find the combination that maximizes validation AUC.

Duration: ~20 minutes • Phase: Model Tuning • Prerequisite: Lab 3 (same train/validation CSVs in S3)

What You Build

Input

📁 S3 Data

Train CSV (70%)
Validation CSV (10%)
Same data as Lab 3

Process

🎯 Tuning Job

12 experiments
4 parallel • Bayesian

Output

🏆 Best Model

Highest AUC
+ correlation analysis

Key difference from Lab 3: Instead of one training job with fixed hyperparameters, you launch 12 jobs where SageMaker intelligently explores different configurations. Each job produces a model artifact — the tuner picks the winner.

Why Not Just Guess?

Hyperparameters have a nonlinear relationship with model performance. Small changes can have outsized effects, and the optimal values depend on your specific data. Manual tuning is slow, biased toward your intuition, and doesn't scale.

🎲

Manual Tuning

Trial-and-error. You pick values, train, check metrics, adjust. Slow and limited by human patience — typically 3–5 experiments.

📊

Grid Search

Try every combination in a predefined grid. Thorough but exponentially expensive — 5 params × 4 values each = 1,024 jobs.

🧠

Bayesian (This Lab)

Uses results from completed jobs to predict which regions of the search space are most promising. Finds good solutions in far fewer experiments.

Key Concepts Covered

🔍

HyperparameterTuner

SageMaker SDK class that wraps an Estimator with search ranges, objective metric, and job budget (max_jobs, max_parallel_jobs).

📏

Parameter Ranges

Three types: ContinuousParameter (float), IntegerParameter (int), CategoricalParameter (enum). Define the search space boundaries.

🎯

Objective Metric

The single number the tuner optimizes. Here: validation:auc (Maximize). AUC measures ranking quality independent of threshold.

⏱️

Early Stopping

Automatically halts training jobs that aren't improving — saves compute cost without sacrificing the best result.

🔄 Tuning Pipeline

The automatic tuning workflow orchestrates multiple training jobs, evaluates their results, and uses Bayesian optimization to decide what to try next. Click each node to explore what happens at each stage.

🔄 Click any node below or press Auto-play to walk through the tuning pipeline step by step.
⚙️ Configure Ranges + Objective 🚀 Launch Jobs 4 parallel × 3 waves 🧠 Bayesian Update Learn from results 📊 Evaluate Compare AUC scores 🏆 Best Model Deploy candidate
Pipeline Details
StageSelect a node above

Lab 3 vs Lab 4 — Side by Side

DimensionLab 3 (Single Training)Lab 4 (Automatic Tuning)
Training Jobs1 job12 jobs (budget)
HyperparametersFixed values you choseRanges — SageMaker explores
StrategyManual guessBayesian optimization
Parallelism1 instance4 concurrent jobs
Early StoppingNot usedAuto — stops unpromising jobs
Output1 model artifact12 artifacts — best selected
Cost~$0.25 (5 min × 1 instance)~$1.50 (12 jobs, some stopped early)
ObjectiveObserved loglossMaximized AUC (0→1 scale)

⚙️ XGBoost Hyperparameters

The tuning job explores 5 hyperparameters simultaneously. Each controls a different aspect of how XGBoost builds its ensemble of decision trees. Understanding what each one does helps you interpret the correlation plots at the end.

The 5 Tuned Parameters

ParameterTypeRangeWhat It ControlsToo LowToo High
alphaContinuous0 – 2L1 regularization (sparsity). Pushes unimportant feature weights to exactly zero.Overfitting (all features used)Underfitting (too many features zeroed)
etaContinuous0 – 1Learning rate. Shrinks each tree's contribution to make boosting more conservative.Needs many rounds to convergeOvershoots — unstable training
max_depthInteger1 – 10Maximum tree depth. Deeper trees capture more complex interactions.Underfitting (too simple)Overfitting (memorizes noise)
min_child_weightContinuous1 – 10Minimum sum of instance weight in a leaf. Higher = more conservative splits.Overfitting (tiny leaf groups)Underfitting (ignores small patterns)
num_roundInteger100 – 1000Number of boosting rounds (trees). More rounds = more capacity.Underfitting (not enough trees)Overfitting + longer training time

How Parameters Interact

These parameters don't operate in isolation. The tuner adjusts all 5 simultaneously for each experiment, which is why you can't simply read one correlation chart in isolation.

💡

eta × num_round

Low learning rate needs more rounds to converge. High eta with many rounds overshoots. The tuner finds the sweet spot between them.

🌳

max_depth × alpha

Deep trees with low regularization overfit. Shallow trees with high alpha underfit. The balance determines model complexity.

⚖️

min_child_weight × max_depth

Both control tree complexity from different angles. High min_child_weight with shallow depth = very conservative model.

🎯

Early Stopping Effect

Even with high num_round, early stopping halts training when validation AUC plateaus — preventing wasted compute on overfitting rounds.

Parameter Types in SageMaker SDK

📝 ContinuousParameter(min, max) — Floats sampled from the range. Used for alpha, eta, min_child_weight where fractional values matter.
IntegerParameter(min, max) — Whole numbers only. Used for max_depth and num_round where fractions don't make sense.
CategoricalParameter([values]) — Picks from a fixed list. Not used in this lab, but useful for choosing between algorithms or objective functions.

🧠 Bayesian Optimization Strategy

SageMaker's default tuning strategy uses Bayesian optimization — a probabilistic approach that builds a surrogate model of the objective function and uses it to decide which hyperparameter combinations to try next.

How Bayesian Tuning Works

🧠 Click any node to explore the Bayesian optimization loop, or press Auto-play to walk through it.
↻ Repeat until budget exhausted (12 jobs) 🎲 Sample Pick next config 🏃 Train Run training job 📏 Measure Record AUC score 📈 Update Model Refine surrogate

Exploration vs Exploitation

The Bayesian optimizer must balance two competing goals:

🌍

Exploration

Try configurations in unexplored regions of the search space. Might find a surprisingly good area that initial random samples missed.

⛏️

Exploitation

Focus on regions near the best results so far. Refine the known-good area to squeeze out the last bit of performance.

💡 Why progress isn't steady: You'll see the AUC-over-time chart jump around rather than smoothly increasing. That's the optimizer deliberately exploring new regions (which might score worse) to avoid getting stuck in a local optimum. With only 12 jobs, some exploration is expected.

Tuning Strategies Compared

StrategyHow It WorksBest ForCost
Bayesian (this lab)Builds probabilistic model, samples promising regionsExpensive training jobs, limited budgetLow (smart sampling)
RandomUniformly samples from ranges — no learning between jobsQuick baseline, embarrassingly parallelMedium (no intelligence)
GridExhaustively tries all combinations in a discrete gridSmall search spaces, need reproducibilityHigh (exponential growth)
HyperbandStarts many jobs with small budgets, promotes best performersLarge search spaces, iterative algorithmsLow (early pruning)

📊 Evaluating Tuning Results

After all 12 jobs complete (or are stopped early), you analyze the results: which configuration won, how AUC improved over time, and which hyperparameters correlate most strongly with performance.

Understanding AUC as Objective Metric

AUC (Area Under the ROC Curve) measures how well the model ranks positive examples above negative ones, independent of any classification threshold.

AUC ScoreInterpretationAnalogy
0.50Random chance — no better than flipping a coinGuessing who will leave the company by coin flip
0.60–0.70Weak discrimination — some signal but noisyPredicting attrition using only department
0.70–0.85Good discrimination — useful in productionPredicting attrition with engagement + tenure + salary
0.85–0.95Excellent — strong ranking abilityFraud detection with behavioral + transactional features
0.95–1.00Suspicious — check for target leakageUsing exit_interview_scheduled to predict attrition
⚠️ Why Maximize, not Minimize? Unlike logloss (Lab 3's implicit metric where lower = better), AUC is a "higher is better" metric. The tuner's objective_type='Maximize' tells it to prefer jobs with higher AUC values.

What the Correlation Charts Tell You

After tuning, the lab plots each hyperparameter against the final AUC score with a line of best fit. Here's how to read them:

↗️

Positive Slope

Higher values of this parameter tend to produce better AUC. Consider expanding the upper range in a follow-up tuning job.

↘️

Negative Slope

Lower values work better. The upper end of your range may be causing overfitting or instability.

↔️

Flat Line

This parameter has little effect on AUC in the tested range. You could fix it and tune other parameters instead.

💠

Scattered Points

High variance means this parameter interacts strongly with others. Its effect depends on the combination, not its value alone.

Early Stopping Behavior

Some training jobs in the tuning run will show status "Stopped" rather than "Completed". This is normal and expected.

🛑 How it works: When early_stopping_type='Auto', SageMaker monitors the objective metric across all running jobs. If a job's intermediate results indicate it cannot beat the current best, it's terminated early — saving compute without losing the best model. A stopped job is not a failed job.

Selecting the Best Model

The tuner automatically identifies the best training job via describe_hyper_parameter_tuning_job. The response includes:

FieldWhat It Contains
BestTrainingJob.TrainingJobNameThe job that achieved the highest validation AUC
BestTrainingJob.FinalObjectiveValueThe winning AUC score (e.g., 0.9234)
BestTrainingJob.TunedHyperParametersThe exact alpha, eta, max_depth, min_child_weight, num_round values
TrainingJobStatusCountersHow many jobs Completed vs Stopped vs Failed

The best model's artifact (model.tar.gz) is stored in S3 at the output path you configured. This is the model you'd deploy to an endpoint in Lab 5.

🏢 HCM Mapping — AnyCompany Context

How does hyperparameter tuning apply to real workforce analytics? Let's map the lab concepts to AnyCompany's ML products.

Tuning Scenarios at AnyCompany

🏢 Click a scenario below to see how automatic tuning applies to different AnyCompany ML products.
🚨

Payroll Fraud Detection

Maximize recall — missing a fraudulent transaction costs $50K+. Tune aggressively for sensitivity.

📉

Employee Attrition

Balance precision and recall — false alarms erode manager trust, but missed departures disrupt teams.

💰

Salary Benchmarking

Minimize RMSE — predictions need to be within $5K of market rate to be actionable for compensation teams.

🤖

AnyCompany Assist Routing

Maximize macro-F1 — equal performance across all ticket categories (payroll, benefits, time-off, tax).

Tuning Configuration
ScenarioSelect a card above

Lab 4 → AnyCompany Attrition Model

Mapping the lab's adult income dataset to AnyCompany's attrition prediction problem:

Lab 4 ConceptAnyCompany EquivalentWhy It Matters
Binary target (income >50K)Binary target (left_company = 1/0)Same XGBoost binary:logistic objective
AUC as objective metricAUC for ranking flight-risk employeesHR wants a ranked list, not just yes/no
12 tuning jobs50–100 jobs (larger budget for production)More features + more data justifies larger search
5 hyperparameters tunedSame 5 + possibly subsample, colsample_bytreeAdditional regularization for high-dimensional HR data
Early stopping on AUC plateauEarly stopping + warm start from previous quarter's modelQuarterly retraining reuses prior knowledge
Correlation chartsFeature importance + SHAP for explainabilityHR needs to explain why someone is flagged as flight-risk

Production Tuning at Scale

💡 Multi-model tuning: AnyCompany serves thousands of clients. Each client's workforce has different patterns. In production, you might run separate tuning jobs per client segment (enterprise vs SMB, US vs India, tech vs manufacturing) — each finding its own optimal hyperparameters. SageMaker Pipelines can orchestrate this at scale with a tuning step per segment.
💸 Cost control: With 1000+ client models to tune quarterly, cost matters. Use max_jobs=20 with early_stopping_type='Auto' and Spot instances (up to 90% savings). A typical tuning job for attrition costs ~$3–5 with Spot — manageable even at scale.