MLOps: From Model to Production
Baljeet Dogra
Getting a model to work in a Jupyter notebook is just the beginning. MLOps—Machine Learning Operations—is the discipline of deploying, monitoring, and maintaining ML models in production. This guide covers the essential practices for taking models from development to production reliably.
What is MLOps?
MLOps is the practice of applying DevOps principles to machine learning. It combines ML development (data science, model training) with operations (deployment, monitoring, maintenance) to create reliable, scalable ML systems.
The goal of MLOps is to:
- • Deploy models reliably—automated, repeatable, and safe deployments
- • Monitor model performance—detect drift, degradation, and issues in real-time
- • Version everything—models, data, code, and configurations
- • Enable rapid iteration—test, deploy, and rollback quickly
- • Maintain model quality—ensure models continue to perform well over time
The MLOps Lifecycle
MLOps follows a continuous cycle:
Data preparation, feature engineering, model training, experimentation
Unit tests, integration tests, model evaluation, validation
Model packaging, containerisation, deployment to staging/production
Performance tracking, drift detection, error monitoring, cost tracking
Continuous loop: Monitoring triggers retraining, which leads back to development
1. Model Versioning
Version control isn't just for code—it's essential for models, data, and configurations. Versioning enables reproducibility, rollbacks, and audit trails.
What to Version
Models
Store model artifacts (weights, architecture, metadata) with unique versions
Tools: MLflow, Weights & Biases, DVC, S3 with versioning
Data
Version training datasets, feature stores, and data pipelines
Tools: DVC, Git LFS, data versioning systems
Code
Training scripts, preprocessing code, inference code, configurations
Tools: Git, with proper branching strategies
Configurations
Hyperparameters, feature flags, environment configs
Store alongside code or in dedicated config management systems
Best Practices for Model Versioning
- Use semantic versioning: Major.Minor.Patch (e.g., v1.2.3) for model releases
- Link versions: Link model version to code commit, data version, and config version
- Store metadata: Training metrics, dataset info, hyperparameters, environment details
- Tag production models: Clearly mark which versions are in production
- Enable rollback: Keep previous versions accessible for quick rollback
2. Model Deployment Strategies
How you deploy models depends on your requirements. Here are the main strategies:
2.1 Batch Inference
Process predictions in batches, typically on a schedule (hourly, daily, weekly).
- • Use cases: Recommendations, scoring, reporting, ETL pipelines
- • Pros: Efficient resource usage, easier to scale, cost-effective
- • Cons: Not real-time, requires batch infrastructure
Example: Daily batch job that scores all customers for churn risk, then updates database.
2.2 Real-Time Inference (Online)
Serve predictions on-demand via API endpoints, typically with low latency requirements.
- • Use cases: Chatbots, fraud detection, recommendations, personalisation
- • Pros: Immediate results, interactive applications
- • Cons: Higher infrastructure costs, requires low-latency infrastructure
Example: REST API that returns product recommendations within 100ms of a user request.
2.3 Streaming Inference
Process predictions on streaming data in near real-time.
- • Use cases: Real-time fraud detection, anomaly detection, live recommendations
- • Pros: Real-time insights, handles high-volume streams
- • Cons: Complex infrastructure, requires stream processing expertise
Example: Kafka stream processing that scores transactions for fraud as they occur.
3. Deployment Patterns
Choose deployment patterns that minimise risk and enable safe rollouts:
3.1 Blue-Green Deployment
Run two identical production environments. Deploy new model to "green", test, then switch traffic. If issues occur, instantly switch back to "blue".
Best for: Zero-downtime deployments, easy rollback, critical production systems.
3.2 Canary Deployment
Gradually roll out new model to a small percentage of traffic (e.g., 5%), monitor performance, then gradually increase if successful. Roll back if issues detected.
Best for: Testing new models safely, gradual rollouts, A/B testing.
3.3 Shadow Mode
Run new model in parallel with production model, but don't use its predictions. Compare outputs to validate performance before switching.
Best for: Validating new models, comparing performance, risk-free testing.
4. Model Serving Infrastructure
Choose the right infrastructure for serving your models:
4.1 Containerisation
Package models in containers (Docker) for consistent deployment across environments:
- • Include model, dependencies, and inference code
- • Ensures consistency between dev, staging, and production
- • Enables easy scaling and deployment
- • Works with Kubernetes, ECS, or any container orchestration
4.2 Model Serving Frameworks
Use specialised frameworks for efficient model serving:
- • TensorFlow Serving: For TensorFlow models, optimised for production
- • TorchServe: For PyTorch models
- • MLflow Models: Framework-agnostic model serving
- • Triton Inference Server: NVIDIA's multi-framework serving platform
- • Custom APIs: FastAPI, Flask, or gRPC for custom serving logic
4.3 Serverless Options
For variable or low-volume workloads, consider serverless:
- • AWS SageMaker, Google Cloud AI Platform, Azure ML
- • AWS Lambda, Google Cloud Functions (for lightweight models)
- • Pay per request, auto-scaling, no infrastructure management
5. Model Monitoring
Monitoring is critical. Models degrade over time, and you need to detect issues before they impact users.
5.1 Performance Metrics
Track model performance in production:
Prediction Metrics
- • Accuracy, precision, recall, F1
- • Prediction confidence scores
- • Prediction distribution
Operational Metrics
- • Latency (P50, P95, P99)
- • Throughput (requests/second)
- • Error rates
- • Resource usage (CPU, memory, GPU)
5.2 Data Drift Detection
Monitor for data drift—when production data differs from training data:
- • Feature drift: Distribution of input features changes
- • Concept drift: Relationship between features and target changes
- • Detection methods: Statistical tests (KS test, PSI), distribution comparisons, model confidence monitoring
Tools: Evidently AI, Fiddler, Aporia, or custom drift detection pipelines.
5.3 Model Health Monitoring
Monitor overall model health:
- • Prediction quality: Compare predictions to ground truth (when available)
- • Anomaly detection: Flag unusual prediction patterns
- • Business metrics: Track downstream business impact (conversion rates, revenue, etc.)
- • Alerting: Set up alerts for performance degradation, drift, or errors
6. CI/CD for ML
Automate your ML pipeline with continuous integration and deployment:
CI/CD Pipeline Stages
1. Code Quality Checks
Linting, type checking, code formatting, security scans
2. Unit & Integration Tests
Test data processing, feature engineering, model training logic
3. Model Training & Validation
Train model, run evaluation tests, check performance thresholds
4. Model Packaging
Package model, dependencies, and metadata
5. Deployment
Deploy to staging, run smoke tests, deploy to production
7. Model Retraining & Updates
Models need regular updates. Plan for retraining:
7.1 Retraining Triggers
- • Scheduled: Retrain weekly, monthly, or on a fixed schedule
- • Performance-based: Retrain when metrics drop below threshold
- • Drift-based: Retrain when data drift is detected
- • Data-based: Retrain when new labelled data is available
7.2 Automated Retraining Pipelines
Automate the retraining process:
- • Fetch latest data
- • Run data validation checks
- • Train new model version
- • Evaluate against validation set
- • Compare to current production model
- • Deploy if better, or alert if worse
8. Best Practices Summary
Version Everything
Models, data, code, and configurations. Enable reproducibility and rollback.
Automate Everything
Training, testing, deployment, and monitoring. Reduce manual errors and speed up iteration.
Monitor Continuously
Track performance, drift, errors, and costs. Set up alerts for anomalies.
Test Thoroughly
Unit tests, integration tests, model evaluation, and staging environment validation.
Deploy Safely
Use canary or blue-green deployments. Enable quick rollback. Test in staging first.
Document Everything
Model cards, deployment runbooks, monitoring dashboards, and incident response procedures.
Common MLOps Tools
Model Management
- • MLflow
- • Weights & Biases
- • DVC
- • Kubeflow
Model Serving
- • TensorFlow Serving
- • TorchServe
- • Triton Inference Server
- • Seldon Core
Monitoring
- • Evidently AI
- • Fiddler
- • Aporia
- • Custom dashboards
Orchestration
- • Airflow
- • Prefect
- • Kubeflow Pipelines
- • MLflow Pipelines
Conclusion
MLOps is essential for deploying and maintaining ML models in production. The key principles are:
- Version control: Track models, data, code, and configs for reproducibility
- Automation: CI/CD pipelines for testing, training, and deployment
- Monitoring: Track performance, detect drift, and alert on issues
- Safe deployment: Use canary or blue-green patterns for risk-free rollouts
- Continuous improvement: Retrain models regularly and iterate based on monitoring
Start simple—version your models, set up basic monitoring, and automate deployment. Then gradually add more sophisticated MLOps practices as your needs grow. The goal is reliable, maintainable ML systems that deliver value consistently.
Need Help Setting Up MLOps?
If you're looking to deploy ML models to production or improve your MLOps practices, I can help with model versioning, deployment pipelines, monitoring setup, and CI/CD automation. Let's discuss your requirements.
Get in Touch