Module 5 - Interactive Explainer

Choosing a Modeling Approach

Navigate SageMaker built-in algorithms, AutoML with Autopilot, model selection trade-offs, and cost optimization strategies for production ML.

🤖 Algorithm Selection ⚡ Interactive 🏢 HCM Context

🧠 SageMaker AI Built-in Algorithms

SageMaker provides a comprehensive suite of preconfigured, optimized algorithms that cover a wide range of ML tasks. These handle distributed training, data loading, and infrastructure automatically — letting you build and deploy models without extensive customization. The model development workflow follows: prepared data → algorithm selection → training → evaluation → output model.

💡
Note: Beyond built-in algorithms, you can also bring custom algorithms or use pre-trained models from AWS Marketplace. This flexibility lets you choose the approach that best aligns with your use case, existing codebase, and team expertise.

Algorithm Family Explorer

Click any family to see its algorithms and AnyCompany use cases:

📋 Tree-Based Models: XGBoost, CatBoost. Best for tabular data — handles missing values, feature interactions, non-linear relationships. AnyCompany: fraud detection, income prediction, attrition classification. Start here for any structured data problem.
SageMaker AI Built-in Algorithms 📈 Linear Models Regression, Classification 🌳 Tree-Based XGBoost, CatBoost 🧬 Neural Networks Deep Learning 🎯 Clustering K-Means, RCF 📊 Forecasting DeepAR, Prophet 📝 Text & Vision BlazingText, CNN

🌳 What Is a Decision Tree?

A decision tree is a flowchart-like structure where each internal node asks a yes/no question about a feature, each branch represents the answer, and each leaf node gives a prediction. Click any node in the tree below to see how it makes decisions:

🌳 Root Node — "Is tenure < 2 years?" The tree starts here. It picks the single question that best separates leavers from stayers. Short tenure is the strongest signal of attrition risk, so it splits on this first.
YES NO YES NO YES NO Tenure < 2 yrs? Root split (most important) Engagement < 3? Low engagement check (short tenure employees) Salary < 30th %ile? Below-market pay check (longer tenure employees) 🚨 LEAVE P = 0.87 ⚠ RISK P = 0.55 ⚠ RISK P = 0.42 ✅ STAY P = 0.12 Depth 0 Depth 1 Depth 2 High risk (P > 0.7) Moderate risk Low risk (P < 0.2)
🌳

Single Decision Tree

Simple, interpretable, but fragile. One tree with max_depth=2 (shown above) can only ask 2 questions. Easy to understand but misses complex patterns. Prone to overfitting on training data.

🌲🌲🌲

Random Forest (100 trees)

Build 100 trees on random subsets of data. Each tree votes on the prediction. Majority wins. More robust than a single tree — reduces variance. But all trees are built independently (no learning from mistakes).

🌳→🌳→🌳

XGBoost (1000 sequential trees)

Build trees one after another. Each new tree focuses on the errors the previous trees got wrong. Tree #500 specializes in the hard cases that trees #1–499 couldn't solve. This is why XGBoost beats Random Forest on most tabular problems.

XGBoost: Sequential Tree Building

XGBoost doesn't just build one tree — it builds hundreds or thousands in sequence. Each tree is small and weak on its own, but together they form a powerful ensemble. Click any tree to see what it learns:

🌳 Tree 1 — Learn the big patterns: The first tree captures the most obvious signal (tenure < 2 years → likely to leave). It gets ~65% accuracy. The remaining 35% of errors become the training data for Tree 2.
errors errors errors errors 🌳 Tree 1 Big patterns Acc: 65% 🌳 Tree 2 Fix Tree 1 errors Acc: 74% 🌳 Tree 3 Fix remaining Acc: 80% ••• Trees 4–999 Incremental fixes Acc: 85% → 89% 🎯 Ensemble Sum of all trees Acc: 89% WHAT EACH TREE SPECIALIZES IN: TREE 1 LEARNS: tenure < 2 → leave high salary → stay (obvious patterns) TREE 2 LEARNS: low engagement → leave even with long tenure (Tree 1 missed these) TREE 3 LEARNS: 3+ manager changes → leave (instability) (subtler signal) TREES 4–999: edge cases, combos remote + no promo + ... (diminishing returns) FINAL PREDICTION: Sum all tree outputs → sigmoid → P(leave) (weighted by eta=0.1)
💡
Key insight: Each tree is intentionally weak (shallow, max_depth=5). A single tree might only be 60% accurate. But 1000 weak trees, each correcting different errors, combine into a 89%+ accurate ensemble. This is the "boosting" in XGBoost — boosting weak learners into a strong learner through sequential error correction.
💡
Lab 3 connection: In Lab 3, you set max_depth=5 and num_round=1000. This means XGBoost builds 1000 trees, each up to 5 levels deep (can ask 5 sequential questions). The tree above shows depth=2 for simplicity — imagine 3 more levels of splits below each leaf, and 999 more trees each correcting different errors.

🔧 Four Implementation Options

OptionDescriptionBest ForEffort
Built-in AlgorithmsPre-built, optimized algorithms in SageMakerStandard ML problems with tabular/text/image dataLow
Script ModeYour code running on SageMaker managed infrastructureCustom logic with familiar frameworks (PyTorch, TF, sklearn)Medium
Bring Your Own ContainerCustom Docker container with full controlProprietary algorithms, special dependenciesHigh
AWS MarketplaceThird-party pre-trained modelsSpecialized domains, pre-trained solutionsLow

🚀 Amazon SageMaker Autopilot

Autopilot is AutoML simplified — deeply integrated throughout SageMaker AI systems. It automatically explores multiple algorithms, tunes hyperparameters, and produces a leaderboard of models. Unlike black-box solutions, Autopilot provides complete transparency by generating notebooks that document the entire ML pipeline, so you can understand and customize every step.

Autopilot Workflow

Click any step to explore, or auto-play to walk through the full process:

📋 Prepare Dataset: Upload clean tabular data (CSV) to S3. Ensure you have a clear target column. Autopilot handles the rest — algorithm selection, feature engineering, hyperparameter tuning.
📥 Prepare Upload CSV to S3 Configure Target + Metric 🔬 Explore Algorithms + Tune 🏆 Leaderboard Rank Models 🚀 Deploy One-Click

When to Use Autopilot

Great For

Quick baselines, new use cases where you are unsure which algorithm works best, teams without deep ML expertise, tabular classification and regression. Democratizes ML across departments that previously lacked technical resources.

Not Ideal For

Unstructured data (images, audio), real-time streaming, custom architectures, or when you need full control over training logic. Also not suited for very large datasets where manual optimization is more cost-effective.

💡
Access Autopilot from: SageMaker Canvas (no-code, business analysts), SageMaker Studio (notebook-based, data scientists), SageMaker Pipelines (automated CI/CD workflows), or the Python SDK (programmatic integration). The same AutoML engine powers all entry points — choose based on your team's technical level.

🎯 Selecting Built-in Algorithms

Selecting the right algorithm directly impacts model performance, accuracy, and suitability for your use case. SageMaker organizes built-in algorithms by learning type and data modality. Key factors to consider: problem type, data characteristics (size, dimensionality, noise), performance requirements, training time, interpretability needs, model complexity, scalability, and domain knowledge.

Supervised Learning Algorithms

Problem TypeAlgorithmBest ForHCM Example
Binary ClassificationXGBoost, Linear Learner, CatBoostYes/No predictions from tabular dataWill employee leave? Is transaction fraud?
Multi-class ClassificationXGBoost, k-NN, Linear LearnerCategorize into 3+ classesRoute ticket to IT/HR/Payroll/Benefits
RegressionXGBoost, k-NN, Linear LearnerPredict continuous numbersPredict salary, time-to-hire, demand
Time-Series ForecastingDeepARPredict future values in sequencesWorkforce demand, seasonal payroll volume
RecommendationFactorization MachinesUser-item interaction predictionsLearning path recommendations

Unsupervised Learning Algorithms

Problem TypeAlgorithmBest ForHCM Example
Anomaly DetectionRandom Cut Forest, IP InsightsFind unusual patterns without labelsPayroll fraud, login anomalies
ClusteringK-MeansGroup similar data pointsEmployee segments, job role taxonomy
Topic ModelingLDA, NTMDiscover themes in textSupport ticket topic extraction
Dimensionality ReductionPCAReduce feature countCompress 100+ HR features to top 10
🎯
XGBoost is your Swiss Army knife. For tabular data (which is most HCM data), XGBoost handles classification, regression, and ranking. It is fast, accurate, handles missing values, and works at scale. Start here for any structured data problem.

🔍 Interpretability vs Performance

At AnyCompany, ML models affect compensation, hiring, and career outcomes. Interpretability is particularly important in domains where understanding the model's decision-making process is essential — healthcare, finance, and HR. Some models like decision trees offer high interpretability, while neural networks can be more opaque. The best model often involves trade-offs between accuracy and explainability.

Interpretability Spectrum

Click any model type to see its trade-offs:

📋 XGBoost (Moderate Interpretability): Best tabular accuracy + SHAP explanations. Feature importance scores rank which inputs drove each prediction. At AnyCompany: "This employee was flagged because tenure is low AND salary is below market." Good balance of accuracy and explainability.
HIGH INTERPRETABILITY HIGH ACCURACY 📏 Linear Very High Interp. 🌳 Decision Tree High Interp. 🌲 Random Forest Moderate 📊 XGBoost Moderate + SHAP 🧬 Neural Net Low Interp.
Regulatory requirements: EU AI Act, NYC Local Law 144 (automated hiring), and EEOC guidelines require explainability for models that affect employment decisions. At AnyCompany, any model influencing hiring, compensation, or termination MUST be interpretable.

💰 ML Cost Considerations

ML is compute-intensive. Different models have varying computational requirements during training — understanding cost drivers helps you choose the right model complexity, hardware, and scaling strategy. Data characteristics (size, dimensionality) directly influence which models are feasible and what resources they need. Balance performance goals with available budget.

Cost Factor Explorer

Click any factor to see optimization strategies:

📋 CPU vs GPU: CPUs are cheaper and sufficient for tabular ML (XGBoost, linear models). GPUs are needed for deep learning (neural networks, transformers) but cost 3-10x more. At AnyCompany: fraud detection (XGBoost) runs on CPU at ~$5/run. Document OCR (CNN) needs GPU at ~$50/run.
Cost Optimization Reduce spend, maintain quality 🖥 CPU vs GPU Hardware Choice Spot Instances 60-90% savings 🗜 Compression Quantize, Prune 🔄 Transfer Learn Pre-trained start
💰 Cost Comparison
Strategy:CPU vs GPU Selection
CPU cost:~$0.50/hr (ml.m5.xlarge)
GPU cost:~$3.80/hr (ml.g4dn.xlarge)
Rule:Tabular = CPU. Deep learning = GPU.
AnyCompany:Fraud (CPU, $5/run) vs OCR (GPU, $50/run)

🎮 Algorithm Picker

Given a business problem, select the right algorithm. Click a scenario to see the recommended approach, reasoning, and full configuration.

💰

Income Classification

Predict whether income is above/below $50K from demographics CSV.

🛡

Real-Time Fraud Detection

Detect fraudulent transactions as they happen with no labeled fraud data.

📅

Purchase Forecasting

Predict when a customer will make their next purchase based on history.

🏷

Discount Response

Classify whether customer responds best to small, large, or no discount.

📋 Income Classification: This is supervised binary classification. You have labeled data (income known). Recommended: XGBoost (best accuracy on tabular), Linear Learner (interpretable baseline), CatBoost (handles categoricals natively).
🎯 Algorithm Recommendation
Problem type:Supervised / Binary Classification
Primary algo:XGBoost (best tabular accuracy)
Alternative:Linear Learner (more interpretable)
Compute:CPU sufficient (tabular data)
Interpretability:SHAP values for feature importance
Est. cost:Low (~$5 per training run on ml.m5.xlarge)

📝 Module Summary

Built-in Algorithms

6+ model families covering supervised, unsupervised, text, vision, and time-series problems.

Autopilot (AutoML)

Automated algorithm selection and tuning for tabular data. Quick baselines with transparent pipelines.

Interpretability

Balance accuracy vs explainability. Use SHAP/LIME for complex models. Regulatory compliance requires it.

Cost Optimization

CPU for tabular, GPU for deep learning. Spot instances, Savings Plans, model compression reduce costs.