Module 8 - Interactive Explainer
From endpoint configuration to traffic shifting to inference optimization - deploy ML models safely at enterprise scale serving millions of workforce transactions.
Once your model is trained and evaluated, it needs a production home. AWS offers four deployment targets - each optimized for different operational requirements. At AnyCompany, the choice depends on latency needs, team expertise, and scale.
Fully managed, minimal ops overhead. Built-in monitoring, auto-scaling, A/B testing. Best default choice for most AnyCompany ML models.
Managed Kubernetes. Advanced custom configurations, multi-framework serving. For teams already running microservices on K8s (AnyCompany platform teams).
Managed container orchestration without Kubernetes complexity. Good middle ground when you need container control but not K8s features.
Serverless, pay-per-invocation. Lightweight models with intermittent traffic. AnyCompany: simple scoring functions triggered by events.
Production data drifts from training data. Payroll patterns change seasonally, new countries onboard, regulations shift. Models degrade silently.
AnyCompany processes payroll for millions of workers. Year-end and tax season create 10x traffic spikes. Endpoints must auto-scale without latency degradation.
Models need regular retraining as data evolves. Deploying new versions without downtime or regression requires sophisticated deployment strategies.
ML deployment is not just model serving - it includes monitoring, logging, rollback, security, and integration with existing AnyCompany microservices.
Deploying a new model version to production is risky. Traffic shifting strategies let you gradually move users to the new version while monitoring for issues - with automatic rollback if something goes wrong.
Run two identical environments. Shift traffic from the old (blue) to the new (green) version using one of three modes:
| Mode | How It Works | Risk Level | AnyCompany Use Case |
|---|---|---|---|
| All-at-Once | 100% traffic flips instantly to new version. Bake, then cleanup old. | Higher | Low-risk model updates (retrained on same features, minor accuracy bump) |
| Canary | Send 10-25% to new version first. If alarms OK, flip remaining traffic. | Medium | New attrition model version - test on subset of predictions before full rollout |
| Linear | Gradually shift in steps (25% → 50% → 75% → 100%) with baking periods between. | Lowest | Payroll fraud model update - critical system, minimize blast radius at every step |
Replace instances one at a time. Old instances are terminated as new ones come online. Simpler than blue/green but no instant rollback.
Payroll Fraud Model: Linear (25% steps, 1-hour bake). Critical system - any regression could miss real fraud. Maximum safety.
Attrition Prediction: Canary (10% canary, 30-min bake). Important but not real-time critical. Moderate caution.
Learning Recommendations: All-at-once. Low-stakes model. If recommendations are slightly worse for an hour, no business impact.
AnyCompany Assist (LLM): Canary (5% canary, 2-hour bake). User-facing, but can tolerate brief quality dip on small traffic slice.
How your model serves predictions depends on latency requirements, traffic patterns, and payload size. SageMaker offers four inference modes.
Always-on endpoint. Sub-second latency. For sustained traffic needing immediate responses. AnyCompany: fraud scoring on each transaction, AnyCompany Assist responses.
Scales to zero when idle. Cold start on first request. For intermittent or unpredictable traffic. AnyCompany: ad-hoc salary benchmarking queries from HR analysts.
Queue requests for processing. Handles large payloads and long processing times. AnyCompany: batch document OCR processing, large report generation.
Process entire datasets offline. No endpoint needed. AnyCompany: monthly attrition scoring for all 50K employees, quarterly salary benchmarking refresh.
| Question | If Yes... | If No... |
|---|---|---|
| Need response for each individual request? | Continue below | Batch Transform |
| Large payloads or long processing time? | Asynchronous | Continue below |
| Sustained traffic with consistent latency needs? | Real-Time | Continue below |
| Intermittent traffic or periods of no traffic? | Serverless | Real-Time |
For complex serving scenarios, SageMaker offers multi-model and multi-container endpoints that host multiple models or processing stages on a single endpoint.
| Architecture | How It Works | Best For | AnyCompany Use Case |
|---|---|---|---|
| Dedicated Endpoint | One model per endpoint | Simple, isolated, predictable performance | Production fraud model (dedicated resources, SLA-bound) |
| Multi-Model Endpoint | Multiple models share one endpoint and container | Many similar models, cost optimization | Per-client attrition models (1000+ clients, same algorithm, different weights) |
| Multi-Container (Pipeline) | Sequential containers: preprocess → model → postprocess | Complex inference workflows | Document OCR: image preprocessing → field extraction → validation |
| Multi-Container (Direct) | Multiple frameworks in one endpoint, invoke individually | Different model types served together | Salary model (XGBoost) + explanation model (SHAP) on same endpoint |
AnyCompany serves thousands of clients. Each client may have a customized attrition model trained on their specific workforce data. Instead of 1000 dedicated endpoints ($$$), use a multi-model endpoint that dynamically loads the right model per request. Cost savings: 90%+ vs dedicated endpoints.
Inference compute is an ongoing cost (unlike training which is one-time). Choosing the right instance type and optimization strategy directly impacts your monthly AWS bill.
| Type | Best For | Cost | AnyCompany Model |
|---|---|---|---|
| CPU (ml.m5, ml.c5) | Simple models, low-latency tabular inference | $ | XGBoost fraud scoring, linear salary prediction |
| GPU (ml.g5, ml.p3) | Complex models, neural networks, batch image processing | $$$ | Document OCR CNN, NLP intent classification |
| AWS Inferentia (ml.inf1/inf2) | ML inference optimized, lower cost than GPU | $$ | AnyCompany Assist LLM serving (high throughput, lower cost) |
Commit to consistent usage for 1-3 years. Save 20-40% on always-on endpoints. Best for production fraud detection (runs 24/7).
Scale instances based on traffic. Scale up during payroll processing peaks, scale down overnight. Pay only for what you use.
Scale to zero when idle. No cost during off-hours. Best for analyst-facing tools used during business hours only.
SageMaker automatically load-tests instance types and recommends the optimal configuration for your model. Removes guesswork.
Select an AnyCompany ML model to see the recommended deployment configuration - target, strategy, inference mode, and compute.
Real-time scoring on every transaction. Zero tolerance for downtime. Critical path.
Score all 50K employees once per month. Results feed into HR dashboards.
User-facing conversational AI. Variable traffic, needs fast responses during business hours.
Process batches of scanned tax forms. Large payloads, async processing acceptable.
| Configuration | Recommendation |
|---|---|
| Deployment Target | SageMaker Real-Time Endpoint (dedicated) |
| Inference Mode | Real-time (sub-100ms latency requirement) |
| Instance Type | ml.c5.xlarge (CPU - XGBoost is CPU-optimized) |
| Scaling | Auto-scaling: min 2, max 10 instances. Scale on InvocationsPerInstance. |
| Deployment Strategy | Linear (25% steps, 1-hour bake, CloudWatch alarm rollback) |
| Estimated Cost | ~$400/month (2 instances baseline + scaling during peaks) |
SageMaker (default), EKS (K8s teams), ECS (containers), Lambda (serverless events).
All-at-once (fast), Canary (balanced), Linear (safest). Auto-rollback on alarm.
Real-time, Serverless, Async, Batch. Match to traffic pattern and latency needs.
Right-size instances, auto-scale, Savings Plans for production, Inferentia for LLMs.