Monitoring Model Performance - Module 11 | AnyCompany ML Engineering

📉 Why Models Degrade Over Time

ML models are trained on historical data - but the world changes. At AnyCompany, payroll patterns shift seasonally, new countries onboard, regulations change, and workforce demographics evolve. A model that was 92% accurate at deployment can silently drop to 70% within months if not monitored.

Four Types of Drift

📊

Data Quality Drift

The statistical properties of incoming data change from what the model was trained on. New data formats, missing fields, or shifted distributions. AnyCompany: a new country onboards with different payroll formats.

📈

Model Quality Drift

Prediction accuracy degrades over time even if data looks similar. The relationship between features and target has changed. AnyCompany: post-COVID remote work changed attrition patterns.

⚖️

Bias Drift

Model fairness changes - predictions become more biased toward certain groups over time. AnyCompany: hiring model starts favoring one demographic as production data composition shifts.

🔀

Feature Attribution Drift

The importance of features changes. A feature that was highly predictive becomes less relevant (or vice versa). AnyCompany: "commute distance" became irrelevant when remote work expanded.

🏢 Drift at AnyCompany Scale

Real-World Drift Scenarios

Data quality drift: India payroll data starts arriving in a new format after HRIS migration. Model receives unexpected null values in tax_filing_status field.

Model quality drift: Fraud detection recall drops from 95% to 78% because fraudsters adapted their patterns after the model was deployed.

Bias drift: Attrition model trained on pre-2020 data predicts higher risk for remote workers (who were rare in training data but now represent 40% of workforce).

Feature attribution drift: office_location was the #2 predictor of attrition. After hybrid work policy, it dropped to #15. manager_1on1_frequency rose to #1.

⚠️

Silent degradation is the biggest risk. Unlike a crashed server (immediately visible), a drifted model keeps serving predictions - they are just increasingly wrong. Without monitoring, AnyCompany could process months of payroll with a degraded fraud model before anyone notices.

🔍 Amazon SageMaker Model Monitor

Model Monitor continuously watches your deployed models, compares live data against training baselines, and alerts you when drift is detected - before it impacts business outcomes.

How It Works: Three Steps

📏

1. Compute Baseline

Analyze training data to establish statistical baselines: distributions, ranges, correlations, and constraints. This is your "known good" reference point.

🔄

2. Compare Incoming Data

Continuously capture inference requests/responses. Compare live data statistics against baseline. Flag deviations that exceed thresholds.

📋

3. Generate Reports

Produce violation reports, statistics, and CloudWatch metrics. Trigger alarms when drift exceeds acceptable bounds. Enable automated remediation.

🔗 Integration with AWS Services

Service	Role	AnyCompany Use
CloudWatch Metrics	Store and visualize monitoring metrics	Dashboard showing fraud model recall over time, data drift scores per feature
CloudWatch Alarms	Alert when metrics breach thresholds	Alarm if data_drift_score > 0.1 for any feature for 10+ minutes
EventBridge	Route events to trigger automated actions	Drift detected event triggers Lambda to start retraining pipeline
CloudTrail	Audit all monitoring API calls	Compliance audit: who changed monitoring thresholds, when
SageMaker Clarify	Bias and explainability analysis	Detect if fraud model becomes biased against specific employee demographics

Monitoring Scenarios & Cadence

Scenario	Cadence	AnyCompany Model
Real-time endpoints	Hourly or daily checks	Fraud detection endpoint - check every hour during business hours
Batch transform jobs	After each batch run	Monthly attrition scoring - validate output after each monthly run
On-demand	Triggered manually or by event	Ad-hoc check after major data migration or system change

✅ Data Quality & Model Quality Monitoring

Two complementary monitoring approaches: data quality checks whether the INPUT is healthy, model quality checks whether the OUTPUT is accurate.

5-Step Data Quality Monitoring

Step	Action	AnyCompany Implementation
1. Data Capture	Capture inference requests/responses to S3	Capture all fraud scoring requests (amount, employee_id, vendor, timestamp)
2. Create Baseline	Generate statistics.json and constraints.json from training data	Baseline: income range $15K-$500K, age 18-70, no nulls in required fields
3. Schedule Monitor	Define monitoring job frequency	Hourly for fraud endpoint, daily for attrition batch output
4. CloudWatch Integration	Set alarms on drift metrics, trigger SNS notifications	Alarm if any feature drifts > 0.1 from baseline for 2+ consecutive checks
5. Interpret Results	Review constraint_violations.json for specific issues	Check: data_type_check, completeness_check, baseline_drift_check, missing_column_check

🎯 Model Quality Monitoring

Model quality requires ground truth - the actual outcomes to compare against predictions. At AnyCompany, this means waiting to see which flagged transactions were actually fraud, or which predicted leavers actually left.

📊

Collect Ground Truth

Gather actual outcomes: was the transaction really fraud? Did the employee actually leave? Match predictions to reality using unique IDs.

🔗

Merge Predictions + Truth

Join model predictions with ground truth labels. Calculate actual precision, recall, F1 against what the model predicted weeks/months ago.

📉

Track Accuracy Over Time

Plot accuracy metrics over time. Detect gradual degradation. Alert when metrics drop below acceptable thresholds (e.g., recall < 90%).

💡

Ground truth delay: For fraud detection, you know within days (chargebacks, investigations). For attrition prediction, you may wait 90 days to confirm if the employee actually left. Design monitoring cadence around your ground truth availability.

🔄 Automated Remediation

Detecting drift is only half the battle. You need automated responses that fix the problem before it impacts AnyCompany operations. Four types of automated actions:

📧

Stakeholder Notifications

SNS alerts to ML team, Slack/Teams integration, PagerDuty for critical models. AnyCompany: alert MLOps on-call when fraud model drifts.

📊

Data Analysis Alerts

Notify data engineers when input data quality degrades or expected data is missing. AnyCompany: alert when India payroll feed stops arriving.

🔄

Model Retraining

Automatically trigger retraining pipeline when drift exceeds threshold. AnyCompany: retrain fraud model when recall drops below 90%.

📈

Auto-Scaling

Scale compute when utilization metrics spike. AnyCompany: scale fraud endpoint during year-end payroll processing surge.

🔁 Retraining Strategies

Strategy	Trigger	Best For	AnyCompany Example
Event-Driven	Drift detected, alarm fires	Critical models where accuracy matters most	Fraud model: retrain immediately when recall drops below threshold
Scheduled	Calendar-based (weekly, monthly, quarterly)	Models with predictable data refresh cycles	Attrition model: retrain quarterly after performance review data arrives
On-Demand	Manual trigger by ML team	After major business changes or data migrations	After AnyCompany acquires new company - retrain on merged employee data

🔧 Troubleshooting Tools

📋

Model Registry

Version history of all models. Compare current production model against previous versions. Roll back if new version performs worse.

🪪

Model Cards

Documentation for each model: intended use, limitations, performance metrics, bias assessments. Required for AnyCompany compliance audits.

🔗

Lineage Tracking

Trace any prediction back to: which model version, trained on which data version, with which code version. Essential for debugging production issues.

🏗️ Lab 7: Auto-Retraining Architecture

Lab 7 implements a complete closed-loop monitoring and retraining system. When drift is detected, the system automatically retrains and redeploys the model - no human intervention required for routine drift.

Architecture Components

Component	Role	How It Connects
SageMaker Endpoint	Serves predictions in production	Data Capture sends requests/responses to S3
SageMaker Data Capture	Records all inference traffic	Feeds monitoring job with live data samples
Baseline Statistics	Reference point from training data	Monitoring job compares live data against this
SageMaker Monitoring Job	Scheduled comparison of live vs baseline	Outputs violations and metrics to CloudWatch
CloudWatch Alarm	Fires when drift exceeds threshold	Triggers SNS notification
Amazon SNS	Notification routing	Invokes Lambda function
AWS Lambda	Lightweight trigger function	Starts Step Functions state machine
Step Functions	Orchestrates retraining workflow	Runs: retrain model, evaluate, deploy new version

The Closed Loop at AnyCompany

1. Fraud model serves predictions on payroll transactions (SageMaker Endpoint)

2. Data Capture records every request to S3

3. Hourly monitoring job detects: transaction amounts have shifted 15% higher than baseline (new country onboarded)

4. CloudWatch alarm fires: data_drift_score > 0.1

5. SNS notifies Lambda, Lambda starts Step Functions

6. Step Functions: pulls latest data, retrains XGBoost, evaluates (AUC still > 0.92), deploys new model version with linear traffic shifting

7. New model serves predictions adapted to the new data distribution. Zero human intervention.

🎮 Drift Detector

Select an AnyCompany model to see its monitoring configuration - what drift types to watch, thresholds, cadence, and automated response.

🛡️

Payroll Fraud Detection

Real-time endpoint. Transaction patterns shift with new clients and seasonal payroll cycles.

👤

Employee Attrition

Batch scoring. Workforce demographics and work patterns evolve over quarters.

💬

AnyCompany Assist (LLM)

Conversational AI. User query patterns and HR policies change continuously.

📋 Payroll Fraud Detection: Critical real-time model. Monitor hourly for data quality drift (new transaction patterns from new clients). Event-driven retraining when recall drops. Full auto-remediation pipeline: detect, retrain, evaluate, deploy - all automated.

Monitoring Aspect	Configuration
Drift Types Monitored	Data quality (hourly), Model quality (daily with ground truth), Bias (weekly on protected attributes)
Key Metrics	feature_baseline_drift per feature, recall, precision, false_negative_rate
Thresholds	Data drift > 0.1 (alarm), Recall < 90% (critical alarm), Bias disparity > 5% (alert)
Cadence	Data quality: hourly. Model quality: daily (after fraud investigation labels arrive). Bias: weekly.
Auto-Remediation	CloudWatch alarm → SNS → Lambda → Step Functions retraining pipeline. Linear deploy if AUC > 0.92.
Manual Intervention	Required if: retraining fails quality gate, bias detected, or regulatory change affects model logic.

📝 Module Summary

✅

Types of Drift

Data quality, model quality, bias, and feature attribution drift. All degrade models silently over time.

✅

SageMaker Model Monitor

Baseline, compare, report. Integrates with CloudWatch, EventBridge, Clarify. Hourly to on-demand cadence.

✅

Quality Monitoring

5-step process: capture, baseline, schedule, alarm, interpret. Ground truth required for model quality.

✅

Auto-Remediation

Closed-loop: Monitor → CloudWatch → SNS → Lambda → Step Functions → Retrain → Deploy. Zero human intervention for routine drift.