End-to-End MLOps for Scalable Model Deployment: Engineering Best Practices for Efficient and Reliable Machine Learning Systems |
||
|
|
|
© 2024 by IJCTT Journal | ||
Volume-72 Issue-11 |
||
Year of Publication : 2024 | ||
Authors : Koushik Balaji Venkatesan | ||
DOI : 10.14445/22312803/IJCTT-V72I11P118 |
How to Cite?
Koushik Balaji Venkatesan , "End-to-End MLOps for Scalable Model Deployment: Engineering Best Practices for Efficient and Reliable Machine Learning Systems," International Journal of Computer Trends and Technology, vol. 72, no. 11, pp. 165-171, 2024. Crossref, https://doi.org/10.14445/22312803/IJCTT-V72I11P118
Abstract
Machine Learning Operations (MLOps) help integrate machine learning model development with production deployment using best practices from software engineering. The machine learning life cycle brings unique problems, and this paper outlines possible approaches to address and fix them. Key MLOps practices are reviewed, focusing on Continuous Integration and Continuous Deployment (CI/CD), automated testing, and adaptive scaling strategies. Techniques for deploying models based on latency and traffic demands are explored, including traffic routing and shadow deployments. Advanced strategies such as canary releases, A/B testing, automated monitoring and retraining are also discussed. The goal is for organizations to increase reliability, reduce downtime, create scalable, robust ML pipelines, and accelerate innovation by incorporating engineering best practices.
Keywords
CI/CD, Load balancing, Machine learning, MLOps, Shadow testing.
Reference
[1] David Sculley et al., “Hidden Technical Debt in Machine Learning Systems,” NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems, vol. 2, pp. 2503-2511, 2015.
[Google Scholar] [Publisher Link]
[2] Dev Kumar Chaudhary, Sandeep Srivastava, and Vikas Kumar, “A Review on Hidden Debts in Machine Learning Systems,” 2018 Second International Conference on Green Computing and Internet of Things (ICGCIoT), Bangalore, India, pp. 619-624, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Dominik Kreuzberger, Niklas Kuhl, and Sebastian Hirschl, “Machine Learning Operations (MLOps): Overview, Definition, and Architecture,” IEEE Access, vol. 11, pp. 31866-31879, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Satvik Garg et al., “On Continuous Integration / Continuous Delivery for Automated Deployment of Machine Learning Models Using MLOps,” 2021 IEEE Fourth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), Laguna Hills, CA, USA, pp. 25-28, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Matteo Testi et al., “MLOps: A Taxonomy and a Methodology,” IEEE Access, vol. 10, pp. 63606-63618, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Georgios Symeonidis et al., “MLOps - Definitions, Tools and Challenges,” 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, pp. 453-460, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Meenu Mary John, Helena Holmstrom Olsson, and Jan Bosch, “Towards MLOps: A Framework and Maturity Model,” 2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Palermo, Italy, pp. 1-8, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Antonio M. Burgueno-Romero et al., “Towards an Open-Source MLOps Architecture,” IEEE Software, vol. 42, no. 1, pp. 59-64, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Yue Zho, Yue Yu, and Bo Ding, “Towards MLOps: A Case Study of ML Pipeline Platform,” 2020 International Conference on Artificial Intelligence and Computer Engineering (ICAICE), Beijing, China, pp. 494-500, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[10] AWS, Serverless Computing - AWS Lambda, Run Code without Thinking about Servers or Clusters, Amazon Web Services, 2024. [Online]. Available: https://aws.amazon.com/pm/lambda/?gclid=EAIaIQobChMIsdOKmJOOigMV3KlmAh2A_S7bEAAYAiAAEgIQwvD_BwE&trk=5cc8 3e4b-8a6e-4976-92ff7a6198f2fe76&sc_channel=ps&ef_id=EAIaIQobChMIsdOKmJOOigMV3KlmAh2A_S7bEAAYAiAAEgIQwvD_BwE:G:s&s_kwcid= AL!4422!3!651612776783!e!!g!!amazon%20web%20services%20lambda!19828229697!143940519541
[11] AWS, Amazon SageMaker Pipelines, Purpose-Built Service for Machine Learning Workflows, Amazon Web Services, 2024. [Online]. Available: https://aws.amazon.com/sagemaker/pipelines/
[12] DVC By Iterative, Data Version Control - and Much More - For the GenAI Era Free and Open Source, Forever. [Online]. Available: https://dvc.org/
[13] MLflow, ML and GenAI Made Simple, Build Better Models and Generative AI Apps on a Unified, End-to-End, Open Source MLOps Platform, 2024. [Online]. Available: https://mlflow.org/
[14] Docker, Develop Faster, Run Anywhere, Build With the #1 Most-Used Developer Tool, 2024. [Online]. Available: https://www.docker.com/
[15] Kubernetes, Kubernetes, Also Known As K8s, is an Open Source System For Automating Deployment, Scaling, and Management of Containerized Applications, 2024. [Online]. Available: https://kubernetes.io/
[16] AWS, Amazon CloudWatch Documentation, Amazon CloudWatch Provides a Reliable, Scalable, and Flexible Monitoring Solution That You Can Start Using Within Minutes. You No Longer Need to Set Up, Manage, and Scale Your Own Monitoring Systems and Infrastructure, 2024. [Online]. Available: https://docs.aws.amazon.com/cloudwatch/
[17] Prometheus, Prometheus is an Open-Source Systems Monitoring and Alerting Toolkit Originally Built at SoundCloud, 2024. [Online]. Available: https://prometheus.io/docs/introduction/overview/