Module 11 - Interactive Explainer

Monitoring Model Performance

Detect drift before it impacts payroll accuracy, automate retraining when data patterns shift, and maintain model quality across millions of workforce transactions.

๐Ÿ“Š Monitoring โšก Interactive ๐Ÿข HCM Context ๐Ÿงช Lab 7

๐Ÿ“‰ Why Models Degrade Over Time

ML models are trained on historical data - but the world changes. At AnyCompany, payroll patterns shift seasonally, new countries onboard, regulations change, and workforce demographics evolve. A model that was 92% accurate at deployment can silently drop to 70% within months if not monitored.

Four Types of Drift

๐Ÿ“Š

Data Quality Drift

The statistical properties of incoming data change from what the model was trained on. New data formats, missing fields, or shifted distributions. AnyCompany: a new country onboards with different payroll formats.

๐Ÿ“ˆ

Model Quality Drift

Prediction accuracy degrades over time even if data looks similar. The relationship between features and target has changed. AnyCompany: post-COVID remote work changed attrition patterns.

โš–๏ธ

Bias Drift

Model fairness changes - predictions become more biased toward certain groups over time. AnyCompany: hiring model starts favoring one demographic as production data composition shifts.

๐Ÿ”€

Feature Attribution Drift

The importance of features changes. A feature that was highly predictive becomes less relevant (or vice versa). AnyCompany: "commute distance" became irrelevant when remote work expanded.

๐Ÿข Drift at AnyCompany Scale

Real-World Drift Scenarios

Data quality drift: India payroll data starts arriving in a new format after HRIS migration. Model receives unexpected null values in tax_filing_status field.

Model quality drift: Fraud detection recall drops from 95% to 78% because fraudsters adapted their patterns after the model was deployed.

Bias drift: Attrition model trained on pre-2020 data predicts higher risk for remote workers (who were rare in training data but now represent 40% of workforce).

Feature attribution drift: office_location was the #2 predictor of attrition. After hybrid work policy, it dropped to #15. manager_1on1_frequency rose to #1.

โš ๏ธ
Silent degradation is the biggest risk. Unlike a crashed server (immediately visible), a drifted model keeps serving predictions - they are just increasingly wrong. Without monitoring, AnyCompany could process months of payroll with a degraded fraud model before anyone notices.

๐Ÿ” Amazon SageMaker Model Monitor

Model Monitor continuously watches your deployed models, compares live data against training baselines, and alerts you when drift is detected - before it impacts business outcomes.

How It Works: Three Steps

๐Ÿ“

1. Compute Baseline

Analyze training data to establish statistical baselines: distributions, ranges, correlations, and constraints. This is your "known good" reference point.

๐Ÿ”„

2. Compare Incoming Data

Continuously capture inference requests/responses. Compare live data statistics against baseline. Flag deviations that exceed thresholds.

๐Ÿ“‹

3. Generate Reports

Produce violation reports, statistics, and CloudWatch metrics. Trigger alarms when drift exceeds acceptable bounds. Enable automated remediation.

๐Ÿ”— Integration with AWS Services

ServiceRoleAnyCompany Use
CloudWatch MetricsStore and visualize monitoring metricsDashboard showing fraud model recall over time, data drift scores per feature
CloudWatch AlarmsAlert when metrics breach thresholdsAlarm if data_drift_score > 0.1 for any feature for 10+ minutes
EventBridgeRoute events to trigger automated actionsDrift detected event triggers Lambda to start retraining pipeline
CloudTrailAudit all monitoring API callsCompliance audit: who changed monitoring thresholds, when
SageMaker ClarifyBias and explainability analysisDetect if fraud model becomes biased against specific employee demographics

Monitoring Scenarios & Cadence

ScenarioCadenceAnyCompany Model
Real-time endpointsHourly or daily checksFraud detection endpoint - check every hour during business hours
Batch transform jobsAfter each batch runMonthly attrition scoring - validate output after each monthly run
On-demandTriggered manually or by eventAd-hoc check after major data migration or system change

โœ… Data Quality & Model Quality Monitoring

Two complementary monitoring approaches: data quality checks whether the INPUT is healthy, model quality checks whether the OUTPUT is accurate.

5-Step Data Quality Monitoring

StepActionAnyCompany Implementation
1. Data CaptureCapture inference requests/responses to S3Capture all fraud scoring requests (amount, employee_id, vendor, timestamp)
2. Create BaselineGenerate statistics.json and constraints.json from training dataBaseline: income range $15K-$500K, age 18-70, no nulls in required fields
3. Schedule MonitorDefine monitoring job frequencyHourly for fraud endpoint, daily for attrition batch output
4. CloudWatch IntegrationSet alarms on drift metrics, trigger SNS notificationsAlarm if any feature drifts > 0.1 from baseline for 2+ consecutive checks
5. Interpret ResultsReview constraint_violations.json for specific issuesCheck: data_type_check, completeness_check, baseline_drift_check, missing_column_check

๐ŸŽฏ Model Quality Monitoring

Model quality requires ground truth - the actual outcomes to compare against predictions. At AnyCompany, this means waiting to see which flagged transactions were actually fraud, or which predicted leavers actually left.

๐Ÿ“Š

Collect Ground Truth

Gather actual outcomes: was the transaction really fraud? Did the employee actually leave? Match predictions to reality using unique IDs.

๐Ÿ”—

Merge Predictions + Truth

Join model predictions with ground truth labels. Calculate actual precision, recall, F1 against what the model predicted weeks/months ago.

๐Ÿ“‰

Track Accuracy Over Time

Plot accuracy metrics over time. Detect gradual degradation. Alert when metrics drop below acceptable thresholds (e.g., recall < 90%).

๐Ÿ’ก
Ground truth delay: For fraud detection, you know within days (chargebacks, investigations). For attrition prediction, you may wait 90 days to confirm if the employee actually left. Design monitoring cadence around your ground truth availability.

๐Ÿ”„ Automated Remediation

Detecting drift is only half the battle. You need automated responses that fix the problem before it impacts AnyCompany operations. Four types of automated actions:

๐Ÿ“ง

Stakeholder Notifications

SNS alerts to ML team, Slack/Teams integration, PagerDuty for critical models. AnyCompany: alert MLOps on-call when fraud model drifts.

๐Ÿ“Š

Data Analysis Alerts

Notify data engineers when input data quality degrades or expected data is missing. AnyCompany: alert when India payroll feed stops arriving.

๐Ÿ”„

Model Retraining

Automatically trigger retraining pipeline when drift exceeds threshold. AnyCompany: retrain fraud model when recall drops below 90%.

๐Ÿ“ˆ

Auto-Scaling

Scale compute when utilization metrics spike. AnyCompany: scale fraud endpoint during year-end payroll processing surge.

๐Ÿ” Retraining Strategies

StrategyTriggerBest ForAnyCompany Example
Event-DrivenDrift detected, alarm firesCritical models where accuracy matters mostFraud model: retrain immediately when recall drops below threshold
ScheduledCalendar-based (weekly, monthly, quarterly)Models with predictable data refresh cyclesAttrition model: retrain quarterly after performance review data arrives
On-DemandManual trigger by ML teamAfter major business changes or data migrationsAfter AnyCompany acquires new company - retrain on merged employee data

๐Ÿ”ง Troubleshooting Tools

๐Ÿ“‹

Model Registry

Version history of all models. Compare current production model against previous versions. Roll back if new version performs worse.

๐Ÿชช

Model Cards

Documentation for each model: intended use, limitations, performance metrics, bias assessments. Required for AnyCompany compliance audits.

๐Ÿ”—

Lineage Tracking

Trace any prediction back to: which model version, trained on which data version, with which code version. Essential for debugging production issues.

๐Ÿ—๏ธ Lab 7: Auto-Retraining Architecture

Lab 7 implements a complete closed-loop monitoring and retraining system. When drift is detected, the system automatically retrains and redeploys the model - no human intervention required for routine drift.

Architecture Components

ComponentRoleHow It Connects
SageMaker EndpointServes predictions in productionData Capture sends requests/responses to S3
SageMaker Data CaptureRecords all inference trafficFeeds monitoring job with live data samples
Baseline StatisticsReference point from training dataMonitoring job compares live data against this
SageMaker Monitoring JobScheduled comparison of live vs baselineOutputs violations and metrics to CloudWatch
CloudWatch AlarmFires when drift exceeds thresholdTriggers SNS notification
Amazon SNSNotification routingInvokes Lambda function
AWS LambdaLightweight trigger functionStarts Step Functions state machine
Step FunctionsOrchestrates retraining workflowRuns: retrain model, evaluate, deploy new version
The Closed Loop at AnyCompany

1. Fraud model serves predictions on payroll transactions (SageMaker Endpoint)

2. Data Capture records every request to S3

3. Hourly monitoring job detects: transaction amounts have shifted 15% higher than baseline (new country onboarded)

4. CloudWatch alarm fires: data_drift_score > 0.1

5. SNS notifies Lambda, Lambda starts Step Functions

6. Step Functions: pulls latest data, retrains XGBoost, evaluates (AUC still > 0.92), deploys new model version with linear traffic shifting

7. New model serves predictions adapted to the new data distribution. Zero human intervention.

๐ŸŽฎ Drift Detector

Select an AnyCompany model to see its monitoring configuration - what drift types to watch, thresholds, cadence, and automated response.

๐Ÿ›ก๏ธ

Payroll Fraud Detection

Real-time endpoint. Transaction patterns shift with new clients and seasonal payroll cycles.

๐Ÿ‘ค

Employee Attrition

Batch scoring. Workforce demographics and work patterns evolve over quarters.

๐Ÿ’ฌ

AnyCompany Assist (LLM)

Conversational AI. User query patterns and HR policies change continuously.

๐Ÿ“‹ Payroll Fraud Detection: Critical real-time model. Monitor hourly for data quality drift (new transaction patterns from new clients). Event-driven retraining when recall drops. Full auto-remediation pipeline: detect, retrain, evaluate, deploy - all automated.
Monitoring AspectConfiguration
Drift Types MonitoredData quality (hourly), Model quality (daily with ground truth), Bias (weekly on protected attributes)
Key Metricsfeature_baseline_drift per feature, recall, precision, false_negative_rate
ThresholdsData drift > 0.1 (alarm), Recall < 90% (critical alarm), Bias disparity > 5% (alert)
CadenceData quality: hourly. Model quality: daily (after fraud investigation labels arrive). Bias: weekly.
Auto-RemediationCloudWatch alarm → SNS → Lambda → Step Functions retraining pipeline. Linear deploy if AUC > 0.92.
Manual InterventionRequired if: retraining fails quality gate, bias detected, or regulatory change affects model logic.

๐Ÿ“ Module Summary

โœ…

Types of Drift

Data quality, model quality, bias, and feature attribution drift. All degrade models silently over time.

โœ…

SageMaker Model Monitor

Baseline, compare, report. Integrates with CloudWatch, EventBridge, Clarify. Hourly to on-demand cadence.

โœ…

Quality Monitoring

5-step process: capture, baseline, schedule, alarm, interpret. Ground truth required for model quality.

โœ…

Auto-Remediation

Closed-loop: Monitor → CloudWatch → SNS → Lambda → Step Functions → Retrain → Deploy. Zero human intervention for routine drift.