Research Article | Open Access | Download PDF
Volume 72 | Issue 10 | Year 2024 | Article Id. IJCTT-V72I10P101 | DOI : https://doi.org/10.14445/22312803/IJCTT-V72I10P101Asynchronous Inference Graph Execution for Model Routing in Machine Learning Systems
Gangadharan Venkataraman
| Received | Revised | Accepted | Published | 
|---|---|---|---|
| 16 Aug 2024 | 20 Sep 2024 | 05 Oct 2024 | 22 Oct 2024 | 
Citation :
Gangadharan Venkataraman, "Asynchronous Inference Graph Execution for Model Routing in Machine Learning Systems," International Journal of Computer Trends and Technology (IJCTT), vol. 72, no. 10, pp. 1-4, 2024. Crossref, https://doi.org/10.14445/22312803/ IJCTT-V72I10P101
Abstract
It is for this reason that this paper creates a routing mechanism in machine learning systems by performing asynchronous inference graphs for such systems. The system will allow model chaining, champion/challenger evaluation, and traffic splitting; hence, it will have very efficient model deployment strategies. In detail, we describe the architecture and implementation of the routing mechanism along with its application to real-world ML pipelines.
Keywords
Inference Service, Model Routing, Asynchronous Execution, Model Chaining, Champion/Challenger, Traffic Splitting.
References
[1] D. Sculley et al., “Hidden Technical Debt in Machine Learning Systems,” NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal Canada, vol. 2, pp. 2503-2511, 2015.
[Google Scholar] [Publisher Link]
[2] Daniel Crankshaw et al., “Clipper: A Low-Latency Online Prediction Serving System,” 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI), Boston, MA, pp. 613-627, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Matei Zaharia et al., “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” NSDI '13: 10th USENIX Symposium on Networked Systems Design and Implementation, San Jose, CA, pp. 1-14, 2012.
[Google Scholar] [Publisher Link]
[4] Martín Abadi et al., “TensorFlow: A System for Large-Scale Machine Learning,” 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Savannah, GA, USA, pp. 265-283, 2016.
[Google Scholar] [Publisher Link]
[5] Neoklis Polyzotis et al., “Data Management Challenges in Production Machine Learning,” SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data, Chicago Illinois USA, pp. 1723-1726, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Ruben Mayer, and Hans-Arno Jacobsen, “Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools,” ACM Computing Surveys (CSUR), vol. 53, no. 1, pp. 1-37, 2020.
[CrossRef] [Google Scholar] [Publisher Link]