ZenML
Blog Why deep learning development in production is (still) broken
MLOps 3 Mins Read

Why deep learning development in production is (still) broken

Software engineering best practices have not been brought into the machine learning space, with the side-effect that there is a great deal of technical debt in these code bases.

Why deep learning development in production is (still) broken
On this page

Last updated: October 17, 2022.

Around 87% of machine learning projects do not survive to make it to production. There is a disconnect between machine learning being done in Jupyter notebooks on local machines and actually being served to end-users to provide some actual value.

![MLOps]({{ site.url }}/assets/posts/dl_is_broken_01.png) Source: Hidden Technical Debt in Machine Learning Systems (Sculley et al.)

The oft-quoted Hidden Technical Debt paper, whose diagram can be seen here, has been in circulation since 2017, yet still, deep learning in production has a ways to go to catch up to the quality standards attained by more conventional software development. Here is one take on what is broken:

Data is not treated as a first-class citizen

In traditional software development, code is (rightly-so) **a first-class citizen. In ML development, there is a further need for data to be a first-class citizen as well. Therefore, data has to be treated with the same care that most developers give to the code they write.

Right now in most organizations, data is spread everywhere and inaccessible. This is not just about raw data either-even if an organization spends a lot of money into centralizing their data into lakes, critical data is spread across the organization in colabs, notebooks, scripts and pre-processed flat files. This causes, amongst other things:

  • Wasted compute on redundant transformations of data
  • No transparency and accountability of what data trains what models
  • Inability to transfer important training phase to the serving phase (see below)

Different requirements in training and serving

Teams often find it surprising when a well-trained model starts to give spurious results in the real world. The transition from training a model to serving it is far from trivial.

For examples, there is a skew in training and production data, that needs to be taken into account. Secondly, one has to be very careful in making sure that production data goes through the same preprocessing steps in production as in training. Lastly, while training involves running experiments and quickly iterating, serving has even further requirements on the application level, e.g. inference time and costs at scale. All these need to be taken into account to avoid unnecessary surprises when the transition from training to serving happens.

No gold standard yet for ML Ops

Applying DevOps principles for ML development (or MLOps) is all the rage right now. However, there is yet no gold standard for it. The field in its infancy needs to tackle:

  • Resources (compute, GPU etc) are scattered and not being used efficiently across teams
  • No proper CI/CD pipelines
  • No proper monitoring in production (change in data quality etc.)
  • Scaling is hard - in training and in serving
  • Machine learning compute works in spikes, so systems need to be equipped to deal with that

Collaboration is hard

In conventional software development, we use workflows that integrate tickets and version control to make collaboration as seamless and transparent as possible. Unfortunately, ML development still lags behind on this front. This is largely due to the fact that ML developers tend to create silos which include glue-code scripts, preprocessed data pickles, and jupyter notebooks. While all these are useful for research and experimentation, they do not translate well into a robust, long-running, production environment.

In short, in the ML world, there is largely:

  • Non-transparency coupled with individual experimentation
  • Notebook Hell with glue-code scripts
  • No versioning, in data, code or configuration

Conclusion

Most of the problems highlighted above can be solved by proper attention being paid to machine learning development in production, from the first training onwards. The field is catching up, slowly but surely, but it is inevitable that machine learning will catch up with traditional software engineering quickly. Will we see new, even improving, and exciting ML products in our lives at that point? Let’s hope so!

Our attempt to solve these problems is ZenML, an extensible, open-source MLOps framework. We recently launched and are now looking for practitioners to solve their problems in production use-cases! So, head over to GitHub, and don’t forget to leave us a star if you like what you see!

Start deploying AI workflows in production today

Enterprise-grade AI platform trusted by thousands of companies in production

Continue Reading

The Framework Way is the Best Way: the pitfalls of MLOps and how to avoid them

The Framework Way is the Best Way: the pitfalls of MLOps and how to avoid them

As our AI/ML projects evolve and mature, our processes and tooling also need to keep up with the growing demand for automation, quality and performance. But how can we possibly reconcile our need for flexibility with the overwhelming complexity of a continuously evolving ecosystem of tools and technologies? MLOps frameworks promise to deliver the ideal balance between flexibility, usability and maintainability, but not all MLOps frameworks are created equal. In this post, I take a critical look at what makes an MLOps framework worth using and what you should expect from one.

9 Mins Read