Lab 4 — Interactive Explainer
Configure hyperparameter ranges, launch an automatic tuning job with Bayesian optimization, evaluate 12 training experiments, and select the best model by AUC score.
In Lab 3 you trained a single XGBoost model with manually chosen hyperparameters. In this lab you let SageMaker Automatic Model Tuning (AMT) explore the hyperparameter space systematically — launching 12 training jobs across 5 hyperparameter ranges to find the combination that maximizes validation AUC.
Duration: ~20 minutes • Phase: Model Tuning • Prerequisite: Lab 3 (same train/validation CSVs in S3)
Train CSV (70%)
Validation CSV (10%)
Same data as Lab 3
12 experiments
4 parallel • Bayesian
Highest AUC
+ correlation analysis
Key difference from Lab 3: Instead of one training job with fixed hyperparameters, you launch 12 jobs where SageMaker intelligently explores different configurations. Each job produces a model artifact — the tuner picks the winner.
Hyperparameters have a nonlinear relationship with model performance. Small changes can have outsized effects, and the optimal values depend on your specific data. Manual tuning is slow, biased toward your intuition, and doesn't scale.
Trial-and-error. You pick values, train, check metrics, adjust. Slow and limited by human patience — typically 3–5 experiments.
Try every combination in a predefined grid. Thorough but exponentially expensive — 5 params × 4 values each = 1,024 jobs.
Uses results from completed jobs to predict which regions of the search space are most promising. Finds good solutions in far fewer experiments.
SageMaker SDK class that wraps an Estimator with search ranges, objective metric, and job budget (max_jobs, max_parallel_jobs).
Three types: ContinuousParameter (float), IntegerParameter (int), CategoricalParameter (enum). Define the search space boundaries.
The single number the tuner optimizes. Here: validation:auc (Maximize). AUC measures ranking quality independent of threshold.
Automatically halts training jobs that aren't improving — saves compute cost without sacrificing the best result.
The automatic tuning workflow orchestrates multiple training jobs, evaluates their results, and uses Bayesian optimization to decide what to try next. Click each node to explore what happens at each stage.
| Dimension | Lab 3 (Single Training) | Lab 4 (Automatic Tuning) |
|---|---|---|
| Training Jobs | 1 job | 12 jobs (budget) |
| Hyperparameters | Fixed values you chose | Ranges — SageMaker explores |
| Strategy | Manual guess | Bayesian optimization |
| Parallelism | 1 instance | 4 concurrent jobs |
| Early Stopping | Not used | Auto — stops unpromising jobs |
| Output | 1 model artifact | 12 artifacts — best selected |
| Cost | ~$0.25 (5 min × 1 instance) | ~$1.50 (12 jobs, some stopped early) |
| Objective | Observed logloss | Maximized AUC (0→1 scale) |
The tuning job explores 5 hyperparameters simultaneously. Each controls a different aspect of how XGBoost builds its ensemble of decision trees. Understanding what each one does helps you interpret the correlation plots at the end.
| Parameter | Type | Range | What It Controls | Too Low | Too High |
|---|---|---|---|---|---|
alpha | Continuous | 0 – 2 | L1 regularization (sparsity). Pushes unimportant feature weights to exactly zero. | Overfitting (all features used) | Underfitting (too many features zeroed) |
eta | Continuous | 0 – 1 | Learning rate. Shrinks each tree's contribution to make boosting more conservative. | Needs many rounds to converge | Overshoots — unstable training |
max_depth | Integer | 1 – 10 | Maximum tree depth. Deeper trees capture more complex interactions. | Underfitting (too simple) | Overfitting (memorizes noise) |
min_child_weight | Continuous | 1 – 10 | Minimum sum of instance weight in a leaf. Higher = more conservative splits. | Overfitting (tiny leaf groups) | Underfitting (ignores small patterns) |
num_round | Integer | 100 – 1000 | Number of boosting rounds (trees). More rounds = more capacity. | Underfitting (not enough trees) | Overfitting + longer training time |
These parameters don't operate in isolation. The tuner adjusts all 5 simultaneously for each experiment, which is why you can't simply read one correlation chart in isolation.
Low learning rate needs more rounds to converge. High eta with many rounds overshoots. The tuner finds the sweet spot between them.
Deep trees with low regularization overfit. Shallow trees with high alpha underfit. The balance determines model complexity.
Both control tree complexity from different angles. High min_child_weight with shallow depth = very conservative model.
Even with high num_round, early stopping halts training when validation AUC plateaus — preventing wasted compute on overfitting rounds.
SageMaker's default tuning strategy uses Bayesian optimization — a probabilistic approach that builds a surrogate model of the objective function and uses it to decide which hyperparameter combinations to try next.
The Bayesian optimizer must balance two competing goals:
Try configurations in unexplored regions of the search space. Might find a surprisingly good area that initial random samples missed.
Focus on regions near the best results so far. Refine the known-good area to squeeze out the last bit of performance.
| Strategy | How It Works | Best For | Cost |
|---|---|---|---|
| Bayesian (this lab) | Builds probabilistic model, samples promising regions | Expensive training jobs, limited budget | Low (smart sampling) |
| Random | Uniformly samples from ranges — no learning between jobs | Quick baseline, embarrassingly parallel | Medium (no intelligence) |
| Grid | Exhaustively tries all combinations in a discrete grid | Small search spaces, need reproducibility | High (exponential growth) |
| Hyperband | Starts many jobs with small budgets, promotes best performers | Large search spaces, iterative algorithms | Low (early pruning) |
After all 12 jobs complete (or are stopped early), you analyze the results: which configuration won, how AUC improved over time, and which hyperparameters correlate most strongly with performance.
AUC (Area Under the ROC Curve) measures how well the model ranks positive examples above negative ones, independent of any classification threshold.
| AUC Score | Interpretation | Analogy |
|---|---|---|
| 0.50 | Random chance — no better than flipping a coin | Guessing who will leave the company by coin flip |
| 0.60–0.70 | Weak discrimination — some signal but noisy | Predicting attrition using only department |
| 0.70–0.85 | Good discrimination — useful in production | Predicting attrition with engagement + tenure + salary |
| 0.85–0.95 | Excellent — strong ranking ability | Fraud detection with behavioral + transactional features |
| 0.95–1.00 | Suspicious — check for target leakage | Using exit_interview_scheduled to predict attrition |
objective_type='Maximize' tells it to prefer jobs with higher AUC values.
After tuning, the lab plots each hyperparameter against the final AUC score with a line of best fit. Here's how to read them:
Higher values of this parameter tend to produce better AUC. Consider expanding the upper range in a follow-up tuning job.
Lower values work better. The upper end of your range may be causing overfitting or instability.
This parameter has little effect on AUC in the tested range. You could fix it and tune other parameters instead.
High variance means this parameter interacts strongly with others. Its effect depends on the combination, not its value alone.
Some training jobs in the tuning run will show status "Stopped" rather than "Completed". This is normal and expected.
early_stopping_type='Auto', SageMaker monitors the objective metric across all running jobs. If a job's intermediate results indicate it cannot beat the current best, it's terminated early — saving compute without losing the best model. A stopped job is not a failed job.
The tuner automatically identifies the best training job via describe_hyper_parameter_tuning_job. The response includes:
| Field | What It Contains |
|---|---|
BestTrainingJob.TrainingJobName | The job that achieved the highest validation AUC |
BestTrainingJob.FinalObjectiveValue | The winning AUC score (e.g., 0.9234) |
BestTrainingJob.TunedHyperParameters | The exact alpha, eta, max_depth, min_child_weight, num_round values |
TrainingJobStatusCounters | How many jobs Completed vs Stopped vs Failed |
The best model's artifact (model.tar.gz) is stored in S3 at the output path you configured. This is the model you'd deploy to an endpoint in Lab 5.
How does hyperparameter tuning apply to real workforce analytics? Let's map the lab concepts to AnyCompany's ML products.
Maximize recall — missing a fraudulent transaction costs $50K+. Tune aggressively for sensitivity.
Balance precision and recall — false alarms erode manager trust, but missed departures disrupt teams.
Minimize RMSE — predictions need to be within $5K of market rate to be actionable for compensation teams.
Maximize macro-F1 — equal performance across all ticket categories (payroll, benefits, time-off, tax).
Mapping the lab's adult income dataset to AnyCompany's attrition prediction problem:
| Lab 4 Concept | AnyCompany Equivalent | Why It Matters |
|---|---|---|
| Binary target (income >50K) | Binary target (left_company = 1/0) | Same XGBoost binary:logistic objective |
| AUC as objective metric | AUC for ranking flight-risk employees | HR wants a ranked list, not just yes/no |
| 12 tuning jobs | 50–100 jobs (larger budget for production) | More features + more data justifies larger search |
| 5 hyperparameters tuned | Same 5 + possibly subsample, colsample_bytree | Additional regularization for high-dimensional HR data |
| Early stopping on AUC plateau | Early stopping + warm start from previous quarter's model | Quarterly retraining reuses prior knowledge |
| Correlation charts | Feature importance + SHAP for explainability | HR needs to explain why someone is flagged as flight-risk |
max_jobs=20 with early_stopping_type='Auto' and Spot instances (up to 90% savings). A typical tuning job for attrition costs ~$3–5 with Spot — manageable even at scale.