In this talk, Rsqrd AI welcomes Diego Oppenheimer, CEO and co-founder of Algorithmia! Diego goes in depth on why machine learning projects fail and why we don’t see machine learning in production despite how powerful the technology can be. He shares his experiences on the problems surrounding pushing ML into production.
Algorithmia is a machine learning operations (MLOps) platform that helps data science and machine learning leaders deploy, manage, and scale their ML portfolio. It’s a fully integrated machine learning operational environment that allows someone to deploy from anywhere and securely manage a company's ML portfolio from one central location. The company has seen success in deploying AI at scale and managing the ML lifecycle.
The Problem: Machine Learning != Production Machine Learning
Diego starts off with perhaps an obvious but important point. He remarks that machine learning (ML) is not equivalent to production machine learning.
“… if you’re going to leave today with anything…it’s that machine learning and production machine learning are two completely different beasts”
There are multiple factors to consider when it comes to production ML that one may not think about when it comes to typical machine learning. We think about data collection and modeling and accuracy, but there are other important variables to consider: infrastructure, integration into DevOps tools, and deployment.
When it comes to production ML, integration into software is the “ultimate goal”. Production ML is a tool in creating an ultimate product, so thinking about how ML integrates/interacts with software is important.
Teams Can Do More
Algorithmia does a survey every year and gathers data from over 500 practitioners, and they produced some key findings:
- 75% of time is spent on infrastructure tasks, and 25% of time is spent on training models
- 30% of challenges faced are about supporting different languages and frameworks
- 30% of challenges faced are about model management tasks such as versioning and reproducibility
- 38% of challenges are about deploying models at the necessary scale
To view 2020 data, you can interact with their data visualization or read their whitepaper: 2020 state of enterprise machine learning
Teams are spending most of their time running infrastructure tasks, like setting up environments and resolving dependencies, over spending time on actual models.
There are also some key challenges that participants were facing. New frameworks and languages are being released at a considerable rate. It’s a challenge productionalizing all this technology when there seems to be new technologies everyday.
The next challenge is how to manage models. After creating a model, there are challenges in how to version the model and get reproducibility as one continuously develops their technology.
The last challenge is deploying models at an appropriate scale. Models are getting huge and they take up a lot of resources, and it’s become a real engineering challenge to deliver the software experience that someone would want while doing it at scale.
A report by Gartner also showed that the main barrier to delivering business value is lack of successful production of projects
The Problem Continued: Where to Start?
A lot of companies are only just starting to explore using ML. They have all this data and see this emerging technology, and now they’re trying to understand what to do with their data and how to use these new tools. Basically, they’re trying to figure out where to start and how to do it. How does someone make decisions with ML? What tools does someone need? Who should someone hire? When bringing in someone from a purely academic background, they are going to face challenges working in an enterprise setting.
Traditional DevOps and ML lifecycle DevOps are different. The latter requires incredibly fast iterations due to dynamic variables such as different hardware or the data changes. ML moves faster than traditional app development, and the ML development lifecycle keeps evolving. It’s a complex problem, but at the end of the day, DevOps is about making a faster path to production.
Rapid Growth of ML and Cost
When first deploying ML, it’s not very difficult. Resources are easily acquired, hosted, and managed, and the demand for the product is low. One can work with a few models and frameworks ranging between 1-2 languages, have a self-managed DevOps team, and can expect to support a few end-users.
However, things will grow exponentially. Soon, there needs to be support for hundreds of models, automated systems to help with low discoverability, and thousands of calls to an API per second. Supporting growth becomes very expensive.
A big takeaway is that the ML dev lifecycle moves really fast, and it’s important to separate ML development and application development and should be treated as such.
Tactical vs. Technical View of the Problem
Diego breaks down the problem of ML deployment into two perspectives: tactical and technical.
Deploying ML today is economically challenging, and it’s due to a lot of factors:
|Lack of Process||It’s easy to get funding and experiments running, but what needs to
happen to get it into production?
How do we get from POC (proof of concept) to production?
|Plan and fund deployment upfront
Set clear deployment criterias
Bring in stakeholders from IT and DevOps early
Build for repeatability in process
|Wrong Incentives||If you just set goals to innovate and experiment instead of deploy and
align with company business, you’re not going to actually get results.
What is the minimum justifiable improvement?
|Consider using MJIT (Minimal Justifiable Improvement Tree) by Ian Xiao,
a framework that analyzes the cost and value of using ML
|Wrong Teams||You can’t ask data scientists with a lack of eng. experience to
build infrastructure, and teams that lack DevOps experience and are
not partnering with people with the right skills can’t succeed
Does my team have the right skill set to make my solution deployable
in the organization?
|Create hybrid teams of engineering, data scientists, and DevOps engineers
Don’t chase people who can do everything: hard to find
Invest in software and platforms that enhance data science and ML teams
|Lack of Proper Champions||ML projects without executive sponsorship rarely make it to production, and
risk and fear of failure get in the way of have proper supporters (champions)
How to get buy-in from stakeholders?
|Align values and interests
Involve stakeholders up and down the command chain early
Collaborate to achieve goals vs dictating all the decisions
|Wrong Technology||Lack of best practices, not building for measurability and repeatability, not thinking about
access to data, and not addressing the differences between product and development
will hurt prospects of deployment
What is the best ML architecture for my organization?
|Design to execute at scale, repeatedly and efficiently
Replace and upgrade components as needs evolve
Anticipate and allow a variety of tools and technologies to be used
concurrently at every step of the cycle
Remain open to integration with the variety of in-house technologies
The following is a simplification of the ML lifecycle and how Algorithmia views the process: data > train > deploy> manage
Connect to your data management system, publish from your training platform of choice via API, Git, or CI/CD pipeline, deploy and manage models, and integrate with your other models and consuming product applications.
Things to Consider
Training and Production are Different
Training and production conditions are very different. Training has long compute cycles, fixed loads, is stateful, and has a single user. In production, one has short compute bursts, needs elasticity, can be stateless, and has many users.
Heterogeneous Tooling and Dependencies
There are dozens of combinations of different frameworks, languages, hardware, and dependencies. As mentioned before, it seems like new tech emerges every day. The first time someone builds a stack, one thing is making it all work. As they build, they need to fit everything in a way where each one of these dependencies can be upgradable.
Composability Compounds the Challenge
A lot of models don’t make it into production because they fall apart before making into production, and one wants to be able to patch these pieces. How dependencies and pieces fit together are really important. ML models are built out of pipelines, and someone would want every piece of that pipeline to be the best piece it can be. Dependencies can get very complex and they want to be able to break them down into pieces.
Diversity Complicates Auditability and Governance
One needs to make sure they’re compliant with their organization’s rules. Some big questions are who’s calling what, who can access it, and how does someone govern dependencies? Diego recalls a story from a financial services customer where they had a lot of security measures surrounding dependencies, packages, etc. Data scientists kept pulling things from PyPy with no checks, and the DevOps team became livid when they found out.
Lack of Reusability Slows Growth
Everyone should build for reusability. Constantly reinventing the wheel and rewriting the same function becomes inefficient and wastes time on updating dependencies. It’s a better practice to build reusable services that can be used throughout the organization.
Measuring Model Performance
Success and performance are very context-dependent. In academia, the goal is to find truth and to improve accuracy. If it takes 10 years to increase accuracy to 93% from 90%, that’s a valid investment. In enterprise, it’s not the same. For example, if someone buys something on Amazon and gets recommended the same item, it’s not a big deal. The success of these two situations depends on the context, goals, and priorities. No one solution is right for each job.
How to Navigate Common Pitfalls and Key Takeaways
Diego quotes a friend with the following remarks about navigating through all the aforementioned considerations:
- Don’t reinvent the wheel
- Outcomes, not process
- Don’t try to be perfect
- Say no to lock-in
- Tools aren’t solutions
- Audit honestly, revise constantly
If there’s anything to take away from this talk, it’s
“Machine Learning != Production Machine Learning!!”
Cool Stuff to Check Out
Video: Rsqrd AI - Diego Oppenheimer - From R&D to ROI of AI
Data Science is boring by Ian Xiao
The Roadmap to Machine Learning Maturity
Hidden Technical Debt in Machine Learning System
- What are your challenges in productionalizing deep learning? 32m 07s
- When selling to enterprises and you have an executive sponsor, a lot of the core issues are setting the right expectations to what outcomes you want to achieve. What are your experiences working across different industries on how to deal with those questions of setting the right expectations with your sponsors to help them feel very confident about some reasonable way of recouping their ROI on investments. 38m 40s
All ideas and information presented in this post are that of the original speaker unless otherwise noted