Module 11 - Interactive Explainer
Detect drift before it impacts payroll accuracy, automate retraining when data patterns shift, and maintain model quality across millions of workforce transactions.
ML models are trained on historical data - but the world changes. At AnyCompany, payroll patterns shift seasonally, new countries onboard, regulations change, and workforce demographics evolve. A model that was 92% accurate at deployment can silently drop to 70% within months if not monitored.
The statistical properties of incoming data change from what the model was trained on. New data formats, missing fields, or shifted distributions. AnyCompany: a new country onboards with different payroll formats.
Prediction accuracy degrades over time even if data looks similar. The relationship between features and target has changed. AnyCompany: post-COVID remote work changed attrition patterns.
Model fairness changes - predictions become more biased toward certain groups over time. AnyCompany: hiring model starts favoring one demographic as production data composition shifts.
The importance of features changes. A feature that was highly predictive becomes less relevant (or vice versa). AnyCompany: "commute distance" became irrelevant when remote work expanded.
Data quality drift: India payroll data starts arriving in a new format after HRIS migration. Model receives unexpected null values in tax_filing_status field.
Model quality drift: Fraud detection recall drops from 95% to 78% because fraudsters adapted their patterns after the model was deployed.
Bias drift: Attrition model trained on pre-2020 data predicts higher risk for remote workers (who were rare in training data but now represent 40% of workforce).
Feature attribution drift: office_location was the #2 predictor of attrition. After hybrid work policy, it dropped to #15. manager_1on1_frequency rose to #1.
Model Monitor continuously watches your deployed models, compares live data against training baselines, and alerts you when drift is detected - before it impacts business outcomes.
Analyze training data to establish statistical baselines: distributions, ranges, correlations, and constraints. This is your "known good" reference point.
Continuously capture inference requests/responses. Compare live data statistics against baseline. Flag deviations that exceed thresholds.
Produce violation reports, statistics, and CloudWatch metrics. Trigger alarms when drift exceeds acceptable bounds. Enable automated remediation.
| Service | Role | AnyCompany Use |
|---|---|---|
| CloudWatch Metrics | Store and visualize monitoring metrics | Dashboard showing fraud model recall over time, data drift scores per feature |
| CloudWatch Alarms | Alert when metrics breach thresholds | Alarm if data_drift_score > 0.1 for any feature for 10+ minutes |
| EventBridge | Route events to trigger automated actions | Drift detected event triggers Lambda to start retraining pipeline |
| CloudTrail | Audit all monitoring API calls | Compliance audit: who changed monitoring thresholds, when |
| SageMaker Clarify | Bias and explainability analysis | Detect if fraud model becomes biased against specific employee demographics |
| Scenario | Cadence | AnyCompany Model |
|---|---|---|
| Real-time endpoints | Hourly or daily checks | Fraud detection endpoint - check every hour during business hours |
| Batch transform jobs | After each batch run | Monthly attrition scoring - validate output after each monthly run |
| On-demand | Triggered manually or by event | Ad-hoc check after major data migration or system change |
Two complementary monitoring approaches: data quality checks whether the INPUT is healthy, model quality checks whether the OUTPUT is accurate.
| Step | Action | AnyCompany Implementation |
|---|---|---|
| 1. Data Capture | Capture inference requests/responses to S3 | Capture all fraud scoring requests (amount, employee_id, vendor, timestamp) |
| 2. Create Baseline | Generate statistics.json and constraints.json from training data | Baseline: income range $15K-$500K, age 18-70, no nulls in required fields |
| 3. Schedule Monitor | Define monitoring job frequency | Hourly for fraud endpoint, daily for attrition batch output |
| 4. CloudWatch Integration | Set alarms on drift metrics, trigger SNS notifications | Alarm if any feature drifts > 0.1 from baseline for 2+ consecutive checks |
| 5. Interpret Results | Review constraint_violations.json for specific issues | Check: data_type_check, completeness_check, baseline_drift_check, missing_column_check |
Model quality requires ground truth - the actual outcomes to compare against predictions. At AnyCompany, this means waiting to see which flagged transactions were actually fraud, or which predicted leavers actually left.
Gather actual outcomes: was the transaction really fraud? Did the employee actually leave? Match predictions to reality using unique IDs.
Join model predictions with ground truth labels. Calculate actual precision, recall, F1 against what the model predicted weeks/months ago.
Plot accuracy metrics over time. Detect gradual degradation. Alert when metrics drop below acceptable thresholds (e.g., recall < 90%).
Detecting drift is only half the battle. You need automated responses that fix the problem before it impacts AnyCompany operations. Four types of automated actions:
SNS alerts to ML team, Slack/Teams integration, PagerDuty for critical models. AnyCompany: alert MLOps on-call when fraud model drifts.
Notify data engineers when input data quality degrades or expected data is missing. AnyCompany: alert when India payroll feed stops arriving.
Automatically trigger retraining pipeline when drift exceeds threshold. AnyCompany: retrain fraud model when recall drops below 90%.
Scale compute when utilization metrics spike. AnyCompany: scale fraud endpoint during year-end payroll processing surge.
| Strategy | Trigger | Best For | AnyCompany Example |
|---|---|---|---|
| Event-Driven | Drift detected, alarm fires | Critical models where accuracy matters most | Fraud model: retrain immediately when recall drops below threshold |
| Scheduled | Calendar-based (weekly, monthly, quarterly) | Models with predictable data refresh cycles | Attrition model: retrain quarterly after performance review data arrives |
| On-Demand | Manual trigger by ML team | After major business changes or data migrations | After AnyCompany acquires new company - retrain on merged employee data |
Version history of all models. Compare current production model against previous versions. Roll back if new version performs worse.
Documentation for each model: intended use, limitations, performance metrics, bias assessments. Required for AnyCompany compliance audits.
Trace any prediction back to: which model version, trained on which data version, with which code version. Essential for debugging production issues.
Lab 7 implements a complete closed-loop monitoring and retraining system. When drift is detected, the system automatically retrains and redeploys the model - no human intervention required for routine drift.
| Component | Role | How It Connects |
|---|---|---|
| SageMaker Endpoint | Serves predictions in production | Data Capture sends requests/responses to S3 |
| SageMaker Data Capture | Records all inference traffic | Feeds monitoring job with live data samples |
| Baseline Statistics | Reference point from training data | Monitoring job compares live data against this |
| SageMaker Monitoring Job | Scheduled comparison of live vs baseline | Outputs violations and metrics to CloudWatch |
| CloudWatch Alarm | Fires when drift exceeds threshold | Triggers SNS notification |
| Amazon SNS | Notification routing | Invokes Lambda function |
| AWS Lambda | Lightweight trigger function | Starts Step Functions state machine |
| Step Functions | Orchestrates retraining workflow | Runs: retrain model, evaluate, deploy new version |
1. Fraud model serves predictions on payroll transactions (SageMaker Endpoint)
2. Data Capture records every request to S3
3. Hourly monitoring job detects: transaction amounts have shifted 15% higher than baseline (new country onboarded)
4. CloudWatch alarm fires: data_drift_score > 0.1
5. SNS notifies Lambda, Lambda starts Step Functions
6. Step Functions: pulls latest data, retrains XGBoost, evaluates (AUC still > 0.92), deploys new model version with linear traffic shifting
7. New model serves predictions adapted to the new data distribution. Zero human intervention.
Select an AnyCompany model to see its monitoring configuration - what drift types to watch, thresholds, cadence, and automated response.
Real-time endpoint. Transaction patterns shift with new clients and seasonal payroll cycles.
Batch scoring. Workforce demographics and work patterns evolve over quarters.
Conversational AI. User query patterns and HR policies change continuously.
| Monitoring Aspect | Configuration |
|---|---|
| Drift Types Monitored | Data quality (hourly), Model quality (daily with ground truth), Bias (weekly on protected attributes) |
| Key Metrics | feature_baseline_drift per feature, recall, precision, false_negative_rate |
| Thresholds | Data drift > 0.1 (alarm), Recall < 90% (critical alarm), Bias disparity > 5% (alert) |
| Cadence | Data quality: hourly. Model quality: daily (after fraud investigation labels arrive). Bias: weekly. |
| Auto-Remediation | CloudWatch alarm → SNS → Lambda → Step Functions retraining pipeline. Linear deploy if AUC > 0.92. |
| Manual Intervention | Required if: retraining fails quality gate, bias detected, or regulatory change affects model logic. |
Data quality, model quality, bias, and feature attribution drift. All degrade models silently over time.
Baseline, compare, report. Integrates with CloudWatch, EventBridge, Clarify. Hourly to on-demand cadence.
5-step process: capture, baseline, schedule, alarm, interpret. Ground truth required for model quality.
Closed-loop: Monitor → CloudWatch → SNS → Lambda → Step Functions → Retrain → Deploy. Zero human intervention for routine drift.