Lab 4: Hyperparameter Tuning with SageMaker

📋 Lab 4 Overview

In Lab 3 you trained a single XGBoost model with manually chosen hyperparameters. In this lab you let SageMaker Automatic Model Tuning (AMT) explore the hyperparameter space systematically — launching 12 training jobs across 5 hyperparameter ranges to find the combination that maximizes validation AUC.

Duration: ~20 minutes • Phase: Model Tuning • Prerequisite: Lab 3 (same train/validation CSVs in S3)

What You Build

Input

📁 S3 Data

Train CSV (70%)
Validation CSV (10%)
Same data as Lab 3

→

Process

🎯 Tuning Job

12 experiments
4 parallel • Bayesian

→

Output

🏆 Best Model

Highest AUC
+ correlation analysis

Key difference from Lab 3: Instead of one training job with fixed hyperparameters, you launch 12 jobs where SageMaker intelligently explores different configurations. Each job produces a model artifact — the tuner picks the winner.

Why Not Just Guess?

Hyperparameters have a nonlinear relationship with model performance. Small changes can have outsized effects, and the optimal values depend on your specific data. Manual tuning is slow, biased toward your intuition, and doesn't scale.

🎲

Manual Tuning

Trial-and-error. You pick values, train, check metrics, adjust. Slow and limited by human patience — typically 3–5 experiments.

📊

Grid Search

Try every combination in a predefined grid. Thorough but exponentially expensive — 5 params × 4 values each = 1,024 jobs.

🧠

Bayesian (This Lab)

Uses results from completed jobs to predict which regions of the search space are most promising. Finds good solutions in far fewer experiments.

Key Concepts Covered

🔍

HyperparameterTuner

SageMaker SDK class that wraps an Estimator with search ranges, objective metric, and job budget (max_jobs, max_parallel_jobs).

📏

Parameter Ranges

Three types: ContinuousParameter (float), IntegerParameter (int), CategoricalParameter (enum). Define the search space boundaries.

🎯

Objective Metric

The single number the tuner optimizes. Here: validation:auc (Maximize). AUC measures ranking quality independent of threshold.

⏱️

Early Stopping

Automatically halts training jobs that aren't improving — saves compute cost without sacrificing the best result.

🔄 Tuning Pipeline

The automatic tuning workflow orchestrates multiple training jobs, evaluates their results, and uses Bayesian optimization to decide what to try next. Click each node to explore what happens at each stage.

🔄 Click any node below or press Auto-play to walk through the tuning pipeline step by step.

Pipeline Details

StageSelect a node above

Lab 3 vs Lab 4 — Side by Side

Dimension	Lab 3 (Single Training)	Lab 4 (Automatic Tuning)
Training Jobs	1 job	12 jobs (budget)
Hyperparameters	Fixed values you chose	Ranges — SageMaker explores
Strategy	Manual guess	Bayesian optimization
Parallelism	1 instance	4 concurrent jobs
Early Stopping	Not used	Auto — stops unpromising jobs
Output	1 model artifact	12 artifacts — best selected
Cost	~$0.25 (5 min × 1 instance)	~$1.50 (12 jobs, some stopped early)
Objective	Observed logloss	Maximized AUC (0→1 scale)

⚙️ XGBoost Hyperparameters

The tuning job explores 5 hyperparameters simultaneously. Each controls a different aspect of how XGBoost builds its ensemble of decision trees. Understanding what each one does helps you interpret the correlation plots at the end.

The 5 Tuned Parameters

Parameter	Type	Range	What It Controls	Too Low	Too High
`alpha`	Continuous	0 – 2	L1 regularization (sparsity). Pushes unimportant feature weights to exactly zero.	Overfitting (all features used)	Underfitting (too many features zeroed)
`eta`	Continuous	0 – 1	Learning rate. Shrinks each tree's contribution to make boosting more conservative.	Needs many rounds to converge	Overshoots — unstable training
`max_depth`	Integer	1 – 10	Maximum tree depth. Deeper trees capture more complex interactions.	Underfitting (too simple)	Overfitting (memorizes noise)
`min_child_weight`	Continuous	1 – 10	Minimum sum of instance weight in a leaf. Higher = more conservative splits.	Overfitting (tiny leaf groups)	Underfitting (ignores small patterns)
`num_round`	Integer	100 – 1000	Number of boosting rounds (trees). More rounds = more capacity.	Underfitting (not enough trees)	Overfitting + longer training time

How Parameters Interact

These parameters don't operate in isolation. The tuner adjusts all 5 simultaneously for each experiment, which is why you can't simply read one correlation chart in isolation.

💡

eta × num_round

Low learning rate needs more rounds to converge. High eta with many rounds overshoots. The tuner finds the sweet spot between them.

🌳

max_depth × alpha

Deep trees with low regularization overfit. Shallow trees with high alpha underfit. The balance determines model complexity.

⚖️

min_child_weight × max_depth

Both control tree complexity from different angles. High min_child_weight with shallow depth = very conservative model.

🎯

Early Stopping Effect

Even with high num_round, early stopping halts training when validation AUC plateaus — preventing wasted compute on overfitting rounds.

Parameter Types in SageMaker SDK

📝 ContinuousParameter(min, max) — Floats sampled from the range. Used for alpha, eta, min_child_weight where fractional values matter.
IntegerParameter(min, max) — Whole numbers only. Used for max_depth and num_round where fractions don't make sense.
CategoricalParameter([values]) — Picks from a fixed list. Not used in this lab, but useful for choosing between algorithms or objective functions.

🧠 Bayesian Optimization Strategy

SageMaker's default tuning strategy uses Bayesian optimization — a probabilistic approach that builds a surrogate model of the objective function and uses it to decide which hyperparameter combinations to try next.

How Bayesian Tuning Works

🧠 Click any node to explore the Bayesian optimization loop, or press Auto-play to walk through it.

Exploration vs Exploitation

The Bayesian optimizer must balance two competing goals:

🌍

Exploration

Try configurations in unexplored regions of the search space. Might find a surprisingly good area that initial random samples missed.

⛏️

Exploitation

Focus on regions near the best results so far. Refine the known-good area to squeeze out the last bit of performance.

💡 Why progress isn't steady: You'll see the AUC-over-time chart jump around rather than smoothly increasing. That's the optimizer deliberately exploring new regions (which might score worse) to avoid getting stuck in a local optimum. With only 12 jobs, some exploration is expected.

Tuning Strategies Compared

Strategy	How It Works	Best For	Cost
Bayesian (this lab)	Builds probabilistic model, samples promising regions	Expensive training jobs, limited budget	Low (smart sampling)
Random	Uniformly samples from ranges — no learning between jobs	Quick baseline, embarrassingly parallel	Medium (no intelligence)
Grid	Exhaustively tries all combinations in a discrete grid	Small search spaces, need reproducibility	High (exponential growth)
Hyperband	Starts many jobs with small budgets, promotes best performers	Large search spaces, iterative algorithms	Low (early pruning)

📊 Evaluating Tuning Results

After all 12 jobs complete (or are stopped early), you analyze the results: which configuration won, how AUC improved over time, and which hyperparameters correlate most strongly with performance.

Understanding AUC as Objective Metric

AUC (Area Under the ROC Curve) measures how well the model ranks positive examples above negative ones, independent of any classification threshold.

AUC Score	Interpretation	Analogy
0.50	Random chance — no better than flipping a coin	Guessing who will leave the company by coin flip
0.60–0.70	Weak discrimination — some signal but noisy	Predicting attrition using only department
0.70–0.85	Good discrimination — useful in production	Predicting attrition with engagement + tenure + salary
0.85–0.95	Excellent — strong ranking ability	Fraud detection with behavioral + transactional features
0.95–1.00	Suspicious — check for target leakage	Using `exit_interview_scheduled` to predict attrition

⚠️ Why Maximize, not Minimize? Unlike logloss (Lab 3's implicit metric where lower = better), AUC is a "higher is better" metric. The tuner's objective_type='Maximize' tells it to prefer jobs with higher AUC values.

What the Correlation Charts Tell You

After tuning, the lab plots each hyperparameter against the final AUC score with a line of best fit. Here's how to read them:

↗️

Positive Slope

Higher values of this parameter tend to produce better AUC. Consider expanding the upper range in a follow-up tuning job.

↘️

Negative Slope

Lower values work better. The upper end of your range may be causing overfitting or instability.

↔️

Flat Line

This parameter has little effect on AUC in the tested range. You could fix it and tune other parameters instead.

💠

Scattered Points

High variance means this parameter interacts strongly with others. Its effect depends on the combination, not its value alone.

Early Stopping Behavior

Some training jobs in the tuning run will show status "Stopped" rather than "Completed". This is normal and expected.

🛑 How it works: When early_stopping_type='Auto', SageMaker monitors the objective metric across all running jobs. If a job's intermediate results indicate it cannot beat the current best, it's terminated early — saving compute without losing the best model. A stopped job is not a failed job.

Selecting the Best Model

The tuner automatically identifies the best training job via describe_hyper_parameter_tuning_job. The response includes:

Field	What It Contains
`BestTrainingJob.TrainingJobName`	The job that achieved the highest validation AUC
`BestTrainingJob.FinalObjectiveValue`	The winning AUC score (e.g., 0.9234)
`BestTrainingJob.TunedHyperParameters`	The exact alpha, eta, max_depth, min_child_weight, num_round values
`TrainingJobStatusCounters`	How many jobs Completed vs Stopped vs Failed

The best model's artifact (model.tar.gz) is stored in S3 at the output path you configured. This is the model you'd deploy to an endpoint in Lab 5.

🏢 HCM Mapping — AnyCompany Context

How does hyperparameter tuning apply to real workforce analytics? Let's map the lab concepts to AnyCompany's ML products.

Tuning Scenarios at AnyCompany

🏢 Click a scenario below to see how automatic tuning applies to different AnyCompany ML products.

🚨

Payroll Fraud Detection

Maximize recall — missing a fraudulent transaction costs $50K+. Tune aggressively for sensitivity.

📉

Employee Attrition

Balance precision and recall — false alarms erode manager trust, but missed departures disrupt teams.

💰

Salary Benchmarking

Minimize RMSE — predictions need to be within $5K of market rate to be actionable for compensation teams.

🤖

AnyCompany Assist Routing

Maximize macro-F1 — equal performance across all ticket categories (payroll, benefits, time-off, tax).

Tuning Configuration

ScenarioSelect a card above

Lab 4 → AnyCompany Attrition Model

Mapping the lab's adult income dataset to AnyCompany's attrition prediction problem:

Lab 4 Concept	AnyCompany Equivalent	Why It Matters
Binary target (income >50K)	Binary target (left_company = 1/0)	Same XGBoost binary:logistic objective
AUC as objective metric	AUC for ranking flight-risk employees	HR wants a ranked list, not just yes/no
12 tuning jobs	50–100 jobs (larger budget for production)	More features + more data justifies larger search
5 hyperparameters tuned	Same 5 + possibly `subsample`, `colsample_bytree`	Additional regularization for high-dimensional HR data
Early stopping on AUC plateau	Early stopping + warm start from previous quarter's model	Quarterly retraining reuses prior knowledge
Correlation charts	Feature importance + SHAP for explainability	HR needs to explain why someone is flagged as flight-risk

Production Tuning at Scale

💡 Multi-model tuning: AnyCompany serves thousands of clients. Each client's workforce has different patterns. In production, you might run separate tuning jobs per client segment (enterprise vs SMB, US vs India, tech vs manufacturing) — each finding its own optimal hyperparameters. SageMaker Pipelines can orchestrate this at scale with a tuning step per segment.

💸 Cost control: With 1000+ client models to tune quarterly, cost matters. Use max_jobs=20 with early_stopping_type='Auto' and Spot instances (up to 90% savings). A typical tuning job for attrition costs ~$3–5 with Spot — manageable even at scale.