Model Deployment Strategies - Module 8 | AnyCompany ML Engineering

🎯 Deployment Targets on AWS

Once your model is trained and evaluated, it needs a production home. AWS offers four deployment targets - each optimized for different operational requirements. At AnyCompany, the choice depends on latency needs, team expertise, and scale.

Four Deployment Options

🧠

SageMaker AI Endpoints

Fully managed, minimal ops overhead. Built-in monitoring, auto-scaling, A/B testing. Best default choice for most AnyCompany ML models.

☸️

Amazon EKS

Managed Kubernetes. Advanced custom configurations, multi-framework serving. For teams already running microservices on K8s (AnyCompany platform teams).

🐳

Amazon ECS

Managed container orchestration without Kubernetes complexity. Good middle ground when you need container control but not K8s features.

⚡

AWS Lambda

Serverless, pay-per-invocation. Lightweight models with intermittent traffic. AnyCompany: simple scoring functions triggered by events.

🎯

AnyCompany recommendation: Start with SageMaker Endpoints for all ML models. Only move to EKS/ECS if you need custom serving logic that SageMaker cannot support. Use Lambda only for lightweight, event-driven scoring (e.g., trigger fraud check on each payroll submission).

⚠️ Deployment Challenges

📊

Data Quality in Production

Production data drifts from training data. Payroll patterns change seasonally, new countries onboard, regulations shift. Models degrade silently.

📈

Scalability

AnyCompany processes payroll for millions of workers. Year-end and tax season create 10x traffic spikes. Endpoints must auto-scale without latency degradation.

🔄

Continuous Updates

Models need regular retraining as data evolves. Deploying new versions without downtime or regression requires sophisticated deployment strategies.

🔧

Infrastructure & DevOps

ML deployment is not just model serving - it includes monitoring, logging, rollback, security, and integration with existing AnyCompany microservices.

🔄 Deployment & Traffic Shifting Strategies

Deploying a new model version to production is risky. Traffic shifting strategies let you gradually move users to the new version while monitoring for issues - with automatic rollback if something goes wrong.

Blue/Green Deployment

Run two identical environments. Shift traffic from the old (blue) to the new (green) version using one of three modes:

Mode	How It Works	Risk Level	AnyCompany Use Case
All-at-Once	100% traffic flips instantly to new version. Bake, then cleanup old.	Higher	Low-risk model updates (retrained on same features, minor accuracy bump)
Canary	Send 10-25% to new version first. If alarms OK, flip remaining traffic.	Medium	New attrition model version - test on subset of predictions before full rollout
Linear	Gradually shift in steps (25% → 50% → 75% → 100%) with baking periods between.	Lowest	Payroll fraud model update - critical system, minimize blast radius at every step

Rolling Deployment

Replace instances one at a time. Old instances are terminated as new ones come online. Simpler than blue/green but no instant rollback.

⚠️

Automatic Rollback: SageMaker monitors CloudWatch alarms during deployment. If metrics degrade (latency spikes, error rate increases, accuracy drops), it automatically rolls traffic back to the previous version. At AnyCompany, this prevents a bad model update from affecting millions of payroll calculations.

🎯 Choosing a Strategy

AnyCompany Deployment Playbook

Payroll Fraud Model: Linear (25% steps, 1-hour bake). Critical system - any regression could miss real fraud. Maximum safety.

Attrition Prediction: Canary (10% canary, 30-min bake). Important but not real-time critical. Moderate caution.

Learning Recommendations: All-at-once. Low-stakes model. If recommendations are slightly worse for an hour, no business impact.

AnyCompany Assist (LLM): Canary (5% canary, 2-hour bake). User-facing, but can tolerate brief quality dip on small traffic slice.

⚡ Inference Options

How your model serves predictions depends on latency requirements, traffic patterns, and payload size. SageMaker offers four inference modes.

Four Inference Strategies

⚡

Real-Time Inference

Always-on endpoint. Sub-second latency. For sustained traffic needing immediate responses. AnyCompany: fraud scoring on each transaction, AnyCompany Assist responses.

💤

Serverless Inference

Scales to zero when idle. Cold start on first request. For intermittent or unpredictable traffic. AnyCompany: ad-hoc salary benchmarking queries from HR analysts.

📬

Asynchronous Inference

Queue requests for processing. Handles large payloads and long processing times. AnyCompany: batch document OCR processing, large report generation.

📦

Batch Transform

Process entire datasets offline. No endpoint needed. AnyCompany: monthly attrition scoring for all 50K employees, quarterly salary benchmarking refresh.

🧭 Decision Flowchart

Question	If Yes...	If No...
Need response for each individual request?	Continue below	Batch Transform
Large payloads or long processing time?	Asynchronous	Continue below
Sustained traffic with consistent latency needs?	Real-Time	Continue below
Intermittent traffic or periods of no traffic?	Serverless	Real-Time

🔌 Advanced Endpoint Options

For complex serving scenarios, SageMaker offers multi-model and multi-container endpoints that host multiple models or processing stages on a single endpoint.

Endpoint Architectures

Architecture	How It Works	Best For	AnyCompany Use Case
Dedicated Endpoint	One model per endpoint	Simple, isolated, predictable performance	Production fraud model (dedicated resources, SLA-bound)
Multi-Model Endpoint	Multiple models share one endpoint and container	Many similar models, cost optimization	Per-client attrition models (1000+ clients, same algorithm, different weights)
Multi-Container (Pipeline)	Sequential containers: preprocess → model → postprocess	Complex inference workflows	Document OCR: image preprocessing → field extraction → validation
Multi-Container (Direct)	Multiple frameworks in one endpoint, invoke individually	Different model types served together	Salary model (XGBoost) + explanation model (SHAP) on same endpoint

Multi-Model Endpoint at AnyCompany Scale

AnyCompany serves thousands of clients. Each client may have a customized attrition model trained on their specific workforce data. Instead of 1000 dedicated endpoints ($$$), use a multi-model endpoint that dynamically loads the right model per request. Cost savings: 90%+ vs dedicated endpoints.

🖥️ Compute & Cost Optimization

Inference compute is an ongoing cost (unlike training which is one-time). Choosing the right instance type and optimization strategy directly impacts your monthly AWS bill.

Instance Types for Inference

Type	Best For	Cost	AnyCompany Model
CPU (ml.m5, ml.c5)	Simple models, low-latency tabular inference	$	XGBoost fraud scoring, linear salary prediction
GPU (ml.g5, ml.p3)	Complex models, neural networks, batch image processing	$$$	Document OCR CNN, NLP intent classification
AWS Inferentia (ml.inf1/inf2)	ML inference optimized, lower cost than GPU	$$	AnyCompany Assist LLM serving (high throughput, lower cost)

Cost Optimization Strategies

💰

Savings Plans

Commit to consistent usage for 1-3 years. Save 20-40% on always-on endpoints. Best for production fraud detection (runs 24/7).

📊

Auto-Scaling

Scale instances based on traffic. Scale up during payroll processing peaks, scale down overnight. Pay only for what you use.

💤

Serverless for Intermittent

Scale to zero when idle. No cost during off-hours. Best for analyst-facing tools used during business hours only.

🔍

Inference Recommender

SageMaker automatically load-tests instance types and recommends the optimal configuration for your model. Removes guesswork.

💡

Test vs Production compute: Use cheap Spot instances and smaller types for testing. Reserve stable, right-sized instances with Savings Plans for production. At AnyCompany, production endpoints serving payroll fraud detection need guaranteed availability - never use Spot for production inference.

🎮 Deployment Planner

Select an AnyCompany ML model to see the recommended deployment configuration - target, strategy, inference mode, and compute.

🛡️

Payroll Fraud Detection

Real-time scoring on every transaction. Zero tolerance for downtime. Critical path.

👤

Monthly Attrition Scoring

Score all 50K employees once per month. Results feed into HR dashboards.

💬

AnyCompany Assist (Chatbot)

User-facing conversational AI. Variable traffic, needs fast responses during business hours.

📄

Document OCR Pipeline

Process batches of scanned tax forms. Large payloads, async processing acceptable.

📋 Payroll Fraud Detection: Mission-critical, real-time scoring. Every payroll transaction must be scored before processing. Zero downtime tolerance. Linear traffic shifting for updates. Dedicated endpoint with auto-scaling for year-end peaks.

Configuration	Recommendation
Deployment Target	SageMaker Real-Time Endpoint (dedicated)
Inference Mode	Real-time (sub-100ms latency requirement)
Instance Type	ml.c5.xlarge (CPU - XGBoost is CPU-optimized)
Scaling	Auto-scaling: min 2, max 10 instances. Scale on InvocationsPerInstance.
Deployment Strategy	Linear (25% steps, 1-hour bake, CloudWatch alarm rollback)
Estimated Cost	~$400/month (2 instances baseline + scaling during peaks)

📝 Module Summary

✅

Deployment Targets

SageMaker (default), EKS (K8s teams), ECS (containers), Lambda (serverless events).

✅

Traffic Shifting

All-at-once (fast), Canary (balanced), Linear (safest). Auto-rollback on alarm.

✅

Inference Options

Real-time, Serverless, Async, Batch. Match to traffic pattern and latency needs.

✅

Cost Optimization

Right-size instances, auto-scale, Savings Plans for production, Inferentia for LLMs.