You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2024-02-05-evolution-of-mlplatform.md
+21-48Lines changed: 21 additions & 48 deletions
Original file line number
Diff line number
Diff line change
@@ -30,12 +30,11 @@ The idea behind technical debt is to highlight the consequences of prioritizing
30
30
Originally a software engineering concept, Technical debt is also relevant to Machine Learning Systems infact the landmark google paper suggest that ML systems have the propensity to easily gain this technical debt.
31
31
32
32
> Machine learning offers a fantastically powerful toolkit for building useful complex prediction systems quickly. This paper argues it is dangerous to think of these quick wins as coming for free. Using the software engineering framework of technical debt , we find it is common to incur massive ongoing maintenance costs in real-world ML systems
> [Sculley et al (2021) Hidden Technical Debt in Machine Learning Systems](https://www.scribd.com/document/428241724/Hidden-Technical-Debt-in-Machine-Learning-Systems)
35
34
36
35
> As the machine learning (ML) community continues to accumulate years of experience with livesystems, a wide-spread and uncomfortable trend has emerged: developing and deploying ML sys-tems is relatively fast and cheap, but maintaining them over time is difficult and expensive
> [Sculley et al (2021) HiddenTechnicalDebt in MachineLearningSystems](https://www.scribd.com/document/428241724/Hidden-Technical-Debt-in-Machine-Learning-Systems)
39
38
40
39
Technical debt is important to consider especially when trying to move fast. Moving fast is easy, moving fast without acquiring technical debt is alot more complicated.
41
40
@@ -65,19 +64,19 @@ This shift to DevOps and teams teams owning the entire development lifecycle int
65
64
66
65
> The total amount of mental effort a team uses to understand, operate and maintain their designated systems or tasks.
> [Skelton & Pais (2019) Team Topologies](https://teamtopologies.com/book)
69
68
70
69
As teams grapple with the mental effort required by adopting DevOps of understanding, operating, and maintaining systems, cognitive load becomes a barrier to efficiency. The weight of this additional load can hinder productivity, prompting organizations to seek solutions.
71
70
72
71
Platforms emerged as a strategic solution, delicately abstracting unnecessary details of the development lifecycle. This abstraction allows engineers to focus on critical tasks, mitigating cognitive load and fostering a more streamlined workflow.
73
72
74
73
> The purpose of a platform team is to enable stream-aligned teams to deliver work with substantial autonomy. The stream-aligned team maintains full ownership of building, running, and fixing their application in production. The platform team provides internal services to reduce the cognitive load that would be required from stream-aligned teams to develop these underlying services.
> [Skelton & Pais (2019) Team Topologies](https://teamtopologies.com/book)
77
76
78
-
> _Infrastructure Platform teams enable organisations to scale delivery by solving common product and non-functional requirements with resilient solutions. This allows other teams to focus on building their own things and releasing value for their users_
77
+
> Infrastructure Platform teams enable organisations to scale delivery by solving common product and non-functional requirements with resilient solutions. This allows other teams to focus on building their own things and releasing value for their users
> [Rowse & Shepherd (2022) Building Infrastructure Platforms](https://martinfowler.com/articles/building-infrastructure-platform.html)
81
80
82
81
### ML Ops -- Reducing technical debt of machine learning
83
82
@@ -87,66 +86,40 @@ MLOps is a methodology that provides a collection of concepts and workflows desi
87
86
The Rise of Machine Learning Platform
88
87
-------------------------------------
89
88
90
-
The paradigm shifts of DevOps, MLOps and Platform Thinking led to the emergence of Machine Learning platforms. ML platforms are the application of MLOps concepts and workflows and provide a curated developer experience for Machine Learning developers throughout the entire ML lifecycle. These platforms address the challenges of cognitive load, technical debt, quality and developer velocity and increase efficiency, collaboration, and sustainability. As the ML team grows, the benefits amplify, creating a multiplier effect that allows organizations to scale whilst maintaining quality.
89
+
The paradigm shifts of DevOps, MLOps and Platform Thinking led to the emergence of Machine Learning platforms. ML platforms are the application of MLOps concepts and workflows and provide a curated developer experience for Machine Learning developers throughout the entire ML lifecycle. As the ML team grows, the benefits of a platform amplify, creating a multiplier effect that allows organizations to scale whilst maintaining quality and not getting bogged down with technical debt.
90
+
91
91
92
92
### Scribd's ML Platform -- MLOps in Action
93
-
/todo
94
-
Some examples of concepts of DevOps applied to ML (aka ML Ops) are:
93
+
At Scribd we have applied concepts from DevOps to our ML Operations in the following ways
95
94
96
95
1.**Automation:**
97
-
98
-
1. Automation can be applied to many parts of the machine learning lifecycle. The incorporation of automation not only streamlines processes but also addresses technical debt through the establishment of consistency and a standardized and reproducible approach.
99
-
100
-
2. Model deployments which can be automated by the implementation of DevOps CI/CD strategies.
101
-
102
-
3. Automation can also be applied to retraining of machine learning models
96
+
97
+
* Applying CI/CD strategies to model deployments through the use of Jenkins pipelines which deploy models from the Model Registry to AWS based endpoints.
98
+
* Automating Model training throug the use of Airflow DAGS and allowing these DAGS to trigger the deployment pipelines to deploy a model once re-training has occured.
103
99
104
100
2.**Continuous****Testing:**
105
101
106
-
* Continuous testing can be applied as part of a model deployment pipeline, removing the need for manual testing (increasing development velocity) and removing technical debt by ensuring tests are performed consistently
107
-
108
-
* Model validation can be automated using tooling providing consistency between training iterations.
102
+
* Applying continuous testing as part of a model deployment pipeline, removing the need for manual testing.
103
+
* Increased tooling to support model validation testing.
109
104
110
105
3.**Monitoring:**
111
-
112
-
* Monitoring provides key insights and a steps towards creating vital feedback loops.
113
-
114
-
* Monitoring can be applied to real time inference infrastructure revealing performance concerns similar to dev ops.
115
106
116
-
* Monitoring can be applied to Model performance and monitor for model drift in realtime, providing realtime insight and analysis to model performance and when it may need to be retrained.
107
+
* Monitoring real time inference endpoints
108
+
* Monitoring training DAGS
117
109
118
110
4.**Collaboration and Communication:**
119
-
120
-
* Utilize collaboration tools for effective communication and information sharing among team members.
121
-
122
-
* Feature Store's provides a platform for discovering, re using and collaborating on ML features
111
+
112
+
* Feature Store which provides feature discovery and re-use
113
+
* Model Database which provides model collaboration
123
114
124
-
* Model Database's provide a platform for discovering, re using and collaborating on ML Models
125
-
126
-
5.**Version Control:**
115
+
6.**Version Control:**
127
116
128
-
*Applying version control to experiments, machine learning models and features provides better change management and auditing of these ML artifacts
117
+
*Applyied version control to experiments, machine learning models and features
129
118
130
119
131
-
### Benefits to the Organization
132
-
133
-
The adoption of a Machine Learning Platform unfolds a spectrum of benefits:
134
-
135
-
**Increasing Flow of Change (aka developer velocity):** A swift pace in model development and deployment, enhancing overall efficiency.
136
-
137
-
**Fostering Collaboration Amongst Teams:** Breaking down silos and promoting cross-functional collaboration. The platform becomes the silent foundation for collaboration, facilitating a harmonious working environment.
138
-
139
-
**Enforcing Best Practices:** Standardizing and ensuring adherence to best practices across ML projects.
140
-
141
-
**Reducing/Limiting Technical Debt:** Strategically mitigating the risk of accumulating technical debt, ensuring long-term sustainability.
142
-
143
-
**Multiplier Effect:** As the ML team grows, these benefits of the platform amplify—a dividend that multiplies with organizational growth.
0 commit comments