Training ML Models - Module 6 | AnyCompany ML Engineering

🧮 How Models Learn

An ML model is a mathematical function that maps inputs to outputs. Training is the process of finding the best parameters (weights) that minimize prediction errors. Click any node below to explore each stage of the training loop.

📋 Training Data: Feed labeled examples into the algorithm. At AnyCompany: historical employee records with known outcomes (stayed/left, fraud/legitimate). The foundation of all learning.

📥 Training Data

RoleInput examples with known labels

HCM Example50K employee records (stayed/left)

Key RequirementRepresentative, clean, properly split

Common PitfallTarget leakage, class imbalance

🎯 The ML Model: A Mathematical Function

At its core, an ML model is a mathematical function that maps inputs to outputs — similar to y = ax + b but with millions of parameters instead of two. Training finds the parameter values that minimize prediction errors across your dataset.

📐

Algorithm (the Function)

The core logic that defines how inputs combine to produce outputs. XGBoost uses decision trees, neural networks use layers of weighted connections. The algorithm stays fixed — only weights change during training.

⚖️

Weights (the Parameters)

Learned values that the algorithm adjusts during training. A simple linear model has 2 weights; a large language model has billions. Finding optimal weights IS training.

🔄

Epochs (Iterations)

Each full pass through the training data is one epoch. The algorithm evaluates its predictions, measures error, and adjusts weights — repeating until convergence or a stopping criterion is met.

AnyCompany Analogy

Think of training like archery practice. Each arrow (epoch) gives feedback on aim. The archer (algorithm) adjusts stance and grip (weights) based on where arrows land relative to the bullseye (loss function). After enough practice, arrows consistently hit near center — the model has converged.

📐 Model Fitting: The Goldilocks Problem

Model fit reflects how well training has captured genuine patterns versus noise. The goal is a balance between complexity and simplicity — a model that generalizes to new, unseen data.

✅

Ideal Fit

Captures real patterns, generalizes to new data. Predicts attrition for employees it has never seen before with consistent accuracy. Achieved through sufficient quality data and appropriate model complexity.

🌀

Overfitting

Memorizes training data including noise. Perfect on training set, terrible on new data. Like memorizing every past employee instead of learning patterns. Fix: more diverse data, regularization, early stopping.

📏

Underfitting

Too simple to capture patterns. Predicts the same thing for everyone. Like using only tenure to predict attrition — misses compensation, management, and engagement signals. Fix: add features, increase model complexity.

⚠️

Overfitting is the #1 risk at AnyCompany scale. With millions of employee records, complex models can memorize individual patterns. Always validate on held-out data. If training accuracy is 99% but validation is 75%, you are overfitting. Regularization techniques (L1/L2 penalties, dropout) constrain model complexity to prevent this.

🧪 Train / Validate / Test Split

Establishing a performance baseline requires properly partitioned data. The split ratio depends on dataset size, model complexity, and available compute — but the principle is universal: never evaluate on data used for training.

Split	Purpose	Typical Size	HCM Example
Training Set	Model learns patterns from this data	70-80%	40,000 employee records with known outcomes
Validation Set	Tune hyperparameters, detect overfitting during training	10-15%	5,000 records for iterative model refinement
Test Set	Final unbiased performance evaluation on unseen data	10-15%	5,000 records never seen during training or tuning

🎯

Never touch the test set during development. It is your final exam. If you peek at test data during training or tuning, your performance estimate will be optimistically biased. At AnyCompany, this means your fraud detection model might look great in development but fail in production. The validation set is your practice exam — use it freely for iteration.

💡

Evaluation metrics depend on problem type. Classification tasks use accuracy, precision, recall, and F1-score. Regression tasks use MSE and R-squared. Some use cases tolerate more error (salary benchmarking), while others are sensitive to false negatives (fraud detection — a missed fraud costs $50K+).

📉 Loss Functions & Optimization

The loss function measures how wrong your model is. Optimization algorithms find the weights that minimize this loss. Click any node in the gradient descent flow below to explore each concept.

⛰️ Loss Landscape: Imagine a mountain range where altitude = error. Training is like finding the lowest valley. Gradient descent calculates the slope at your current position and steps downhill.

⛰️ Loss Landscape

What It IsMulti-dimensional surface of all possible errors

PeaksBad weight combinations (high loss)

ValleysGood weight combinations (low loss)

GoalFind the deepest valley (global minimum)

Common Loss Functions

The loss function (also called the objective function) quantifies how far predictions are from actual values. Your choice of loss function shapes what the model optimizes for — it defines "what counts as wrong."

Loss Function	Problem Type	What It Measures	HCM Use Case
RMSE	Regression	Standard deviation of prediction errors (penalizes large errors)	Salary prediction error in dollars
Cross-Entropy	Classification	How well predicted probabilities match true labels	Attrition probability calibration
MAE	Regression	Average absolute error (robust to outliers)	Time-to-hire prediction in days
Hinge Loss	Classification (SVM)	Margin between classes	Binary fraud/not-fraud separation

⛰️

Local vs Global Minima: When plotted, the loss function forms a landscape with hills and valleys. The deepest valley is the global minimum (optimal weights). But training can get stuck in shallower valleys (local minima) that seem optimal from the current position. Stochastic gradient descent helps escape local minima through its noisy updates — the randomness can "bounce" the model out of shallow traps.

⚡ Gradient Descent Variants

The choice between optimization techniques depends on dataset size, available compute, and tolerance for noisy updates. Each variant trades off convergence smoothness against speed.

🐢

Batch Gradient Descent

Updates weights after processing ALL training data. Smooth convergence, fewer steps, always moves toward minima. But slow — one update per epoch. Best for small datasets where full-pass computation is feasible.

⚡

Stochastic GD (SGD)

Updates weights after EACH data point. Fastest — 1,000 data points means 1,000 updates per epoch. Noisy/erratic path but can escape local minima. Best for large datasets and online learning scenarios.

🎯

Mini-Batch GD

Updates weights after each BATCH (32-512 samples). Hybrid approach: smoother than SGD, faster than batch. Less memory than SGD, better generalization than full batch. The default choice for most production training.

AnyCompany Training at Scale

With millions of payroll transactions, batch gradient descent would be impossibly slow (one update per full pass through all data). Mini-batch (batch size 256) processes data in manageable chunks, updating weights thousands of times per epoch. This is why SageMaker distributes training across multiple instances — each instance processes different mini-batches in parallel.

💡

Contour plot intuition: Imagine a topographic map where the center is the lowest point (optimal weights). Batch GD draws a smooth line to center. SGD zigzags wildly but arrives fast. Mini-batch takes a moderately smooth path — the practical sweet spot for enterprise ML workloads.

📈 Loss Convergence Visualization

Watch how loss decreases over training epochs. The curve shows the iterative refinement process — each epoch brings the model closer to optimal weights. Click "Animate" to see convergence in action.

💡

Reading the curves: When training loss keeps dropping but validation loss plateaus or rises, the model is overfitting — memorizing training noise instead of learning generalizable patterns. Early stopping halts training at the point where validation loss is lowest, saving the best model weights.

🎛️ Hyperparameters

Hyperparameters are settings you configure BEFORE training starts. Unlike model weights (learned from data), hyperparameters control HOW the model learns. They define the step size, duration, and complexity of the optimization process. Finding the ideal combination helps training progress rapidly toward an optimal solution without instability.

🎛️ Learning Rate: Size of each optimization step. Too large: overshoots the minimum, never converges. Too small: takes forever, gets stuck. Typical range: 0.001 to 0.1. The single most important hyperparameter to tune.

👣 Learning Rate

What It ControlsStep size during weight updates

Too LowSlow training, stuck in local minima

Too HighUnstable, diverges, never converges

Sweet SpotStart at 0.01, use schedulers to decay

AnyCompany TipXGBoost eta=0.1, reduce if loss oscillates

⚖️ Hyperparameter Trade-offs

Hyperparameter	Too Low	Too High	Sweet Spot
Learning Rate	Slow training, stuck in local minima	Unstable, diverges, never converges	Start at 0.01, use learning rate schedulers
Epochs	Underfitting, model has not learned enough	Overfitting, memorizes training noise	Early stopping when val_loss plateaus
Batch Size	Noisy gradients, slow per-epoch	Smooth but poor generalization, high memory	32-256 for most problems
Model Depth	Cannot capture complex patterns	Overfits, slow training, diminishing returns	XGBoost max_depth=4-8, Neural nets 3-6 layers

Tuning for AnyCompany Attrition Model

Learning rate: Start at 0.01, reduce by 10x if loss oscillates

Epochs: Set max to 100, enable early stopping with patience=10

Batch size: 256 for 50K employee dataset

XGBoost-specific: max_depth=6, n_estimators=200, subsample=0.8

⏱️

Early Stopping — your overfitting safety net. Monitor validation loss each epoch. If it stops improving for N consecutive epochs (patience), halt training and keep the best weights. This prevents wasting compute on epochs that only memorize noise. SageMaker AMT supports early stopping natively — unpromising tuning jobs are killed before completion, saving significant cost.

☁️ Training on Amazon SageMaker AI

SageMaker provides fully managed infrastructure for training ML models at any scale. It containerizes ML workloads, provisions compute, and manages the full lifecycle. Three entry points serve different skill levels: Canvas (no-code), Studio (full-code), and Pipelines (automated orchestration). Click any node to explore the training flow.

💻 ML Practitioner: Creates a training job from SageMaker Studio notebook. Specifies algorithm, hyperparameters, data location, and output path. The starting point for every training workflow.

💻 SageMaker Studio IDE

WhatUnified ML development environment

FeaturesJupyterLab, code editor, terminal

ActionDefine Estimator with algorithm + hyperparams

Outputestimator.fit() call triggers training job

🖥️ Compute Options for Training

SageMaker offers EC2 compute environments optimized for containerized ML workloads. The right choice depends on model complexity, dataset size, and budget constraints.

Instance Type	Best For	Cost	AnyCompany Use Case
CPU (ml.m5, ml.c5)	Tabular ML, preprocessing, small-medium models	$	XGBoost attrition, linear regression salary
GPU (ml.p3, ml.g5)	Deep learning, matrix ops, CNNs, RNNs, Transformers	$$$	Document OCR CNN, NLP intent classification
Trainium (ml.trn1)	Large-scale distributed training, cost-optimized DL	$$	Fine-tuning LLMs for AnyCompany Assist
Spot Instances	Fault-tolerant training with checkpointing	60-90% off	Hyperparameter tuning jobs (can restart)

🎯

For AnyCompany payroll/HR models (tabular data): CPU instances are sufficient and 5-10x cheaper than GPUs. Reserve GPUs for document processing (CNNs on scanned tax forms) and NLP models (intent classification for chatbot). Use Trainium for fine-tuning the LLM behind AnyCompany Assist — it offers exceptional performance and cost-efficiency for large language models.

📦 Three Ways to Train on SageMaker

🎨

SageMaker Canvas

No-code/low-code environment. Point-and-click model training for business analysts. Quick prototyping without writing code. Good for initial feasibility checks.

💻

SageMaker Studio

Full-code IDE (JupyterLab, CodeEditor, RStudio). Maximum flexibility and control. Define Estimators, customize containers, use any ML framework. Where ML engineers spend most time.

🔄

SageMaker Pipelines

Automated workflow orchestration. Serverless infrastructure, auto-scaling. Define pipelines visually or via SDK/API. Production-grade repeatable training with governance.

💡

Containerized environments: SageMaker managed container images come pre-built with ML frameworks (PyTorch, TensorFlow, XGBoost) and all dependencies. Built-in algorithms require zero coding — just provide training data, hyperparameters, and compute resources. For custom needs, bring your own Docker image from ECR.

🔄 Pipelines, Registry & Tuning

Production ML is not a one-shot notebook. Click any node below to explore the automated ML pipeline from data ingestion to production deployment.

📥 Data Ingestion: Fresh employee data arrives monthly from HR systems. The pipeline automatically pulls new records from S3, validates schema, and checks for data quality issues before proceeding.

📥 Data Ingestion

TriggerMonthly schedule (1st of each month)

SourceS3 bucket with latest employee data

ValidationSchema check, null rate, row count

OutputValidated dataset passed to feature engineering

📋 SageMaker Model Registry

A centralized repository for managing ML models throughout their lifecycle. Every model version is tracked with full lineage — which data trained it, which hyperparameters were used, who approved it, and when it was deployed.

🏷️

Versions & Groups

Each training run creates a new version (V1, V2, V3). Model groups contain all versions trained for a specific problem. Collections organize multiple groups hierarchically.

✅

Approval Workflows

Models transition through statuses: PendingApproval → Approved → Deployed. At AnyCompany: data science trains, ML ops reviews metrics, compliance approves for production. Only validated models reach users.

🚀

Deployment Automation

CI/CD pipelines auto-deploy approved models. Rollback to previous version if issues detected. Zero-downtime updates ensure production systems always run the latest approved version.

📜

Governance & Audit

Full lineage: which data trained which model, who approved it, when deployed. Model metadata includes baseline metrics and data hash. Critical for AnyCompany regulatory compliance across 140+ jurisdictions.

🎯 Automatic Model Tuning (AMT)

SageMaker AMT uses machine learning itself to find optimal hyperparameters. Instead of manually testing combinations, AMT intelligently explores the hyperparameter space by learning from previous training job results.

Feature	What It Does	Benefit
Bayesian Optimization	Builds a probabilistic model of the objective function, intelligently choosing next configurations	Finds good values faster than random search — learns from every trial
Parallel Jobs	Runs multiple training jobs simultaneously across different configurations	Reduces wall-clock time for tuning from days to hours
Early Stopping	Kills unpromising jobs before completion based on intermediate metrics	Saves 40-60% compute cost on bad configurations
Warm Start	Builds on results from previous tuning runs, reusing learned knowledge	Incremental improvement without starting over — refine as data changes
Hyperband	Allocates resources dynamically — more compute to promising configs	Efficient exploration of large hyperparameter spaces

AMT Setup for AnyCompany

Step 1: Define hyperparameter ranges (e.g., eta: [0.01, 0.3], max_depth: [3, 10])

Step 2: Choose objective metric (validation:auc for attrition model)

Step 3: Set max training jobs (e.g., 50) and max parallel jobs (e.g., 5)

Step 4: Enable early stopping to save resources on poor configurations

Result: AMT returns the best hyperparameter combination and the trained model artifact

🎮 Training Job Lab

Walk through a complete SageMaker training job for an AnyCompany use case. This mirrors Lab 3 tasks: configure an estimator, set hyperparameters, launch a training job, monitor metrics in CloudWatch, and evaluate the resulting model artifact. Select a scenario below, then click nodes or auto-play to step through.

🎯 Select a Training Scenario

👤

Attrition Prediction (XGBoost)

Binary classifier on employee data. CPU instance, ~10 min, ~$2.

📄

Document OCR (CNN)

Fine-tune vision model on scanned tax forms. GPU, ~2 hours, ~$25.

💬

AnyCompany Assist (LLM Fine-tune)

Domain-adapt foundation model on HR Q&A. Trainium, ~6 hours, ~$150.

📋 Attrition Prediction: Train XGBoost on 50K employee records with 25 features. Binary classification (stay/leave). CPU instance, ~10 minutes training time, ~$2 cost. The workhorse model for AnyCompany HR analytics.

⚙️ Configure Estimator

AlgorithmXGBoost (built-in)

Instanceml.m5.xlarge (CPU, 4 vCPU, 16 GB)

Data Paths3://anycompany-ml/attrition/train.csv

Output Paths3://anycompany-ml/models/attrition/

📝 Module Summary

After this module, you should be able to define core elements of the model training process and describe how SageMaker AI features support training at enterprise scale.

✅

Training Concepts

Models learn by minimizing loss functions through gradient descent. Weights are iteratively adjusted across epochs until convergence. Avoid overfitting with proper train/val/test splits and early stopping.

✅

SageMaker Training

Managed training with CPU, GPU, or Trainium instances. Three entry points (Canvas, Studio, Pipelines). Containerized environments with built-in algorithms. Pay-per-second with Spot savings.

✅

Pipelines & Registry

Automate training workflows with serverless orchestration. Version models with full lineage. Approval gates for production. CI/CD deployment automation. Full audit trail for compliance.

✅

Hyperparameter Tuning

AMT uses Bayesian optimization to find optimal configurations. Parallel jobs reduce wall-clock time. Early stopping saves cost. Warm start enables incremental refinement as data evolves.

🧭

What's next: Module 7 covers model evaluation and tuning in depth — precision/recall trade-offs, ROC curves, distributed training strategies, and SageMaker AMT configuration. The iterative loop between training and evaluation is where models go from "works in a notebook" to "production-ready."