Module 6 - Interactive Explainer
From loss functions to gradient descent to SageMaker training jobs — understand how models learn from data and how to train them at enterprise scale for workforce solutions.
An ML model is a mathematical function that maps inputs to outputs. Training is the process of finding the best parameters (weights) that minimize prediction errors. Click any node below to explore each stage of the training loop.
At its core, an ML model is a mathematical function that maps inputs to outputs — similar to y = ax + b but with millions of parameters instead of two. Training finds the parameter values that minimize prediction errors across your dataset.
The core logic that defines how inputs combine to produce outputs. XGBoost uses decision trees, neural networks use layers of weighted connections. The algorithm stays fixed — only weights change during training.
Learned values that the algorithm adjusts during training. A simple linear model has 2 weights; a large language model has billions. Finding optimal weights IS training.
Each full pass through the training data is one epoch. The algorithm evaluates its predictions, measures error, and adjusts weights — repeating until convergence or a stopping criterion is met.
Think of training like archery practice. Each arrow (epoch) gives feedback on aim. The archer (algorithm) adjusts stance and grip (weights) based on where arrows land relative to the bullseye (loss function). After enough practice, arrows consistently hit near center — the model has converged.
Model fit reflects how well training has captured genuine patterns versus noise. The goal is a balance between complexity and simplicity — a model that generalizes to new, unseen data.
Captures real patterns, generalizes to new data. Predicts attrition for employees it has never seen before with consistent accuracy. Achieved through sufficient quality data and appropriate model complexity.
Memorizes training data including noise. Perfect on training set, terrible on new data. Like memorizing every past employee instead of learning patterns. Fix: more diverse data, regularization, early stopping.
Too simple to capture patterns. Predicts the same thing for everyone. Like using only tenure to predict attrition — misses compensation, management, and engagement signals. Fix: add features, increase model complexity.
Establishing a performance baseline requires properly partitioned data. The split ratio depends on dataset size, model complexity, and available compute — but the principle is universal: never evaluate on data used for training.
| Split | Purpose | Typical Size | HCM Example |
|---|---|---|---|
| Training Set | Model learns patterns from this data | 70-80% | 40,000 employee records with known outcomes |
| Validation Set | Tune hyperparameters, detect overfitting during training | 10-15% | 5,000 records for iterative model refinement |
| Test Set | Final unbiased performance evaluation on unseen data | 10-15% | 5,000 records never seen during training or tuning |
The loss function measures how wrong your model is. Optimization algorithms find the weights that minimize this loss. Click any node in the gradient descent flow below to explore each concept.
The loss function (also called the objective function) quantifies how far predictions are from actual values. Your choice of loss function shapes what the model optimizes for — it defines "what counts as wrong."
| Loss Function | Problem Type | What It Measures | HCM Use Case |
|---|---|---|---|
| RMSE | Regression | Standard deviation of prediction errors (penalizes large errors) | Salary prediction error in dollars |
| Cross-Entropy | Classification | How well predicted probabilities match true labels | Attrition probability calibration |
| MAE | Regression | Average absolute error (robust to outliers) | Time-to-hire prediction in days |
| Hinge Loss | Classification (SVM) | Margin between classes | Binary fraud/not-fraud separation |
The choice between optimization techniques depends on dataset size, available compute, and tolerance for noisy updates. Each variant trades off convergence smoothness against speed.
Updates weights after processing ALL training data. Smooth convergence, fewer steps, always moves toward minima. But slow — one update per epoch. Best for small datasets where full-pass computation is feasible.
Updates weights after EACH data point. Fastest — 1,000 data points means 1,000 updates per epoch. Noisy/erratic path but can escape local minima. Best for large datasets and online learning scenarios.
Updates weights after each BATCH (32-512 samples). Hybrid approach: smoother than SGD, faster than batch. Less memory than SGD, better generalization than full batch. The default choice for most production training.
With millions of payroll transactions, batch gradient descent would be impossibly slow (one update per full pass through all data). Mini-batch (batch size 256) processes data in manageable chunks, updating weights thousands of times per epoch. This is why SageMaker distributes training across multiple instances — each instance processes different mini-batches in parallel.
Watch how loss decreases over training epochs. The curve shows the iterative refinement process — each epoch brings the model closer to optimal weights. Click "Animate" to see convergence in action.
Hyperparameters are settings you configure BEFORE training starts. Unlike model weights (learned from data), hyperparameters control HOW the model learns. They define the step size, duration, and complexity of the optimization process. Finding the ideal combination helps training progress rapidly toward an optimal solution without instability.
| Hyperparameter | Too Low | Too High | Sweet Spot |
|---|---|---|---|
| Learning Rate | Slow training, stuck in local minima | Unstable, diverges, never converges | Start at 0.01, use learning rate schedulers |
| Epochs | Underfitting, model has not learned enough | Overfitting, memorizes training noise | Early stopping when val_loss plateaus |
| Batch Size | Noisy gradients, slow per-epoch | Smooth but poor generalization, high memory | 32-256 for most problems |
| Model Depth | Cannot capture complex patterns | Overfits, slow training, diminishing returns | XGBoost max_depth=4-8, Neural nets 3-6 layers |
Learning rate: Start at 0.01, reduce by 10x if loss oscillates
Epochs: Set max to 100, enable early stopping with patience=10
Batch size: 256 for 50K employee dataset
XGBoost-specific: max_depth=6, n_estimators=200, subsample=0.8
SageMaker provides fully managed infrastructure for training ML models at any scale. It containerizes ML workloads, provisions compute, and manages the full lifecycle. Three entry points serve different skill levels: Canvas (no-code), Studio (full-code), and Pipelines (automated orchestration). Click any node to explore the training flow.
SageMaker offers EC2 compute environments optimized for containerized ML workloads. The right choice depends on model complexity, dataset size, and budget constraints.
| Instance Type | Best For | Cost | AnyCompany Use Case |
|---|---|---|---|
| CPU (ml.m5, ml.c5) | Tabular ML, preprocessing, small-medium models | $ | XGBoost attrition, linear regression salary |
| GPU (ml.p3, ml.g5) | Deep learning, matrix ops, CNNs, RNNs, Transformers | $$$ | Document OCR CNN, NLP intent classification |
| Trainium (ml.trn1) | Large-scale distributed training, cost-optimized DL | $$ | Fine-tuning LLMs for AnyCompany Assist |
| Spot Instances | Fault-tolerant training with checkpointing | 60-90% off | Hyperparameter tuning jobs (can restart) |
No-code/low-code environment. Point-and-click model training for business analysts. Quick prototyping without writing code. Good for initial feasibility checks.
Full-code IDE (JupyterLab, CodeEditor, RStudio). Maximum flexibility and control. Define Estimators, customize containers, use any ML framework. Where ML engineers spend most time.
Automated workflow orchestration. Serverless infrastructure, auto-scaling. Define pipelines visually or via SDK/API. Production-grade repeatable training with governance.
Production ML is not a one-shot notebook. Click any node below to explore the automated ML pipeline from data ingestion to production deployment.
A centralized repository for managing ML models throughout their lifecycle. Every model version is tracked with full lineage — which data trained it, which hyperparameters were used, who approved it, and when it was deployed.
Each training run creates a new version (V1, V2, V3). Model groups contain all versions trained for a specific problem. Collections organize multiple groups hierarchically.
Models transition through statuses: PendingApproval → Approved → Deployed. At AnyCompany: data science trains, ML ops reviews metrics, compliance approves for production. Only validated models reach users.
CI/CD pipelines auto-deploy approved models. Rollback to previous version if issues detected. Zero-downtime updates ensure production systems always run the latest approved version.
Full lineage: which data trained which model, who approved it, when deployed. Model metadata includes baseline metrics and data hash. Critical for AnyCompany regulatory compliance across 140+ jurisdictions.
SageMaker AMT uses machine learning itself to find optimal hyperparameters. Instead of manually testing combinations, AMT intelligently explores the hyperparameter space by learning from previous training job results.
| Feature | What It Does | Benefit |
|---|---|---|
| Bayesian Optimization | Builds a probabilistic model of the objective function, intelligently choosing next configurations | Finds good values faster than random search — learns from every trial |
| Parallel Jobs | Runs multiple training jobs simultaneously across different configurations | Reduces wall-clock time for tuning from days to hours |
| Early Stopping | Kills unpromising jobs before completion based on intermediate metrics | Saves 40-60% compute cost on bad configurations |
| Warm Start | Builds on results from previous tuning runs, reusing learned knowledge | Incremental improvement without starting over — refine as data changes |
| Hyperband | Allocates resources dynamically — more compute to promising configs | Efficient exploration of large hyperparameter spaces |
Step 1: Define hyperparameter ranges (e.g., eta: [0.01, 0.3], max_depth: [3, 10])
Step 2: Choose objective metric (validation:auc for attrition model)
Step 3: Set max training jobs (e.g., 50) and max parallel jobs (e.g., 5)
Step 4: Enable early stopping to save resources on poor configurations
Result: AMT returns the best hyperparameter combination and the trained model artifact
Walk through a complete SageMaker training job for an AnyCompany use case. This mirrors Lab 3 tasks: configure an estimator, set hyperparameters, launch a training job, monitor metrics in CloudWatch, and evaluate the resulting model artifact. Select a scenario below, then click nodes or auto-play to step through.
After this module, you should be able to define core elements of the model training process and describe how SageMaker AI features support training at enterprise scale.
Models learn by minimizing loss functions through gradient descent. Weights are iteratively adjusted across epochs until convergence. Avoid overfitting with proper train/val/test splits and early stopping.
Managed training with CPU, GPU, or Trainium instances. Three entry points (Canvas, Studio, Pipelines). Containerized environments with built-in algorithms. Pay-per-second with Spot savings.
Automate training workflows with serverless orchestration. Version models with full lineage. Approval gates for production. CI/CD deployment automation. Full audit trail for compliance.
AMT uses Bayesian optimization to find optimal configurations. Parallel jobs reduce wall-clock time. Early stopping saves cost. Warm start enables incremental refinement as data evolves.