Reliably Serving Machine Learning Models

Tame the magic black box of Machine Learning

Posted by Sam Lacey on July 24, 2018 Reading Time: 12m
Core of the Crab Nebula (Photo: NASA, ESA, and STScI)

Machine learning has taken the world by storm recently with the promise of solving extremely difficult software problems such as facial recognition or performing medical diagnosis. We achieve this by taking huge quantities of data to train very complex deep artificial neural networks that produce something akin to a magic black box. Nobody knows exactly how these magic black boxes work but they can produce frighteningly accurate predictions or analyses. It seems that practically every tech organisation on the planet is currently looking into this field in some capacity. However, getting machine learning algorithms into a production environment can pose a series of interesting problems that, if not handled correctly, will lead to poor performance and unmaintainable software.

This blog post outlines a few strategies for integrating machine learning into software applications that are simple to implement, will eliminate common inefficiencies and help you tame that magic black box.

1 - Semantically version your models

Just like software, machine learning models will change over time. Perhaps new data becomes available and is used for more robust training or your team discovers that the models produce unreliable results suggesting that they’ve over fit during training. Each model your team produces will potentially exhibit wildly different behaviour from one another and can be rather difficult to keep track of, especially if you have numerous models.

Again, just like software, this is where semantic versioning comes in handy. If you take a little bit of time to version each release of your models, accompanied by a short description of what’s changed, you’ll know exactly what versions work well and what to roll back to if there’s a problem. Furthermore, as your model changes over time, the way you need to preprocess input data may also change which will require software changes in your services. This means, different models versions will be dependent on particular versions of your software and libraries, which can quickly become a tangled mess if you don’t document and version.

2 - Don’t bake your models into Docker images

Depending on their complexity, the exported weights of machine learning models can end up being several hundred megabytes in size or more. Furthermore, its probable that over time these models will keep growing as your machine learning team make them more robust and better able to handle various edge cases, which usually takes precedence over compressing them. Assuming you use immutable infrastructure techniques, you’ll eventually need to get these models into your service’s container so the software you’ve built can actually utilise them to perform whatever operations you’ve cooked up.

If you bake the models into your docker images during a build, whenever your machine learning team releases an upgraded version, you’ll have to go through the rigmarole of running through your entire build process, even if no software, libraries or parameters need to change. This might become particularly problematic if you need to continually mess about with your git branching model to accommodate the new builds, especially if your master and develop branches are out of sync. Even if you don’t have this problem and your software build process is completely automated, it’s still going to take a while to iterate through all the steps, which will take time and over complicate what should be a very simple release.

I’ve found a good solution is to store machine learning models in bucket storage, supplied by a cloud service provider, that your services can download specific versions from on demand. This means, if your machine learning team release a new model and no software changes are necessary, then all you have to do is have your existing services download the new versions and your container platform (e.g. Kubernetes) won’t have to spend time downloading an entire ML docker image. Furthermore, you have the peace of mind of knowing that the software in that image has already been running in production for some time without major issues.

3 - Use and abuse compiler optimisations

Machine learning can do amazing, crazy things and is quite rightly being invested in heavily around the globe. However, the models themselves when run in production can come with some pretty severe performance limitations. This is probably due to you switching from using the GPU in that beast of a machine under your desk to a run of the mill CPU in the cloud. This is completely understandable, cloud CPUs are quite cheap, whilst their GPU counterparts are almost prohibitively expensive for most SaaS models. Don’t get me wrong, a cloud CPU will do an admirable job but it will probably leave you with a lot left to be desired, it’s a CPU after all, not a miracle worker!

Luckily, if you are using a major machine learning library like Tensorflow, you may have noticed that they have various compiler optimizations available that can target specific architectures if you take the time to build from source. If you dig a little into the pip or precompiled releases you’ll find that they tend to have been produced targeting the largest range of hardware possible to make their platforms more accessible. Therefore, if you know the architecture of the CPUs that the instance type your cloud platform uses, you can specifically target them in your production builds. Although, incorporating these optimisations into your build might take a significant amount of work, time and experimentation, the payoff is huge.

Additionally, if you use modern Intel CPUs in production, then you may also be able to make use of their Math Kernal Libraries (MKL) in conjunction with your machine learning library, which contains additional optimisations for mathematical operations. Together, these two sets of optimisations can lead to dramatic performance increases of machine learning models when running on CPUs. I’d highly recommend taking the time to investigate and implement these optimisations for your software.

By taking the time to compile optimised libraries for your chosen infrastructure you’ll not only benefit from the performance gains they bring in terms of time but you also may not have to scale as much, saving you some money too!

4 - Turtles all the way down

If you choose to embrace the incredible benefits of compiler optimisations, you’ll notice that build times in your CI will go through the roof and then some. Whilst it’s understandable that these take time (machine learning frameworks are extremely complex projects), you don’t really want to hang around all day waiting for your builds, especially if you need to deal with a production issue quickly.

Fortunately, the underlying machine learning libraries you are currently using probably don’t change that often and definitely don’t need to be re-compiled for every build. If this is the case, then simply create a new docker image and repository specifically to house the optimisations and base all of your services utilising machine learning of off this image. Further expanding this idea, you can also create different images for each time consuming step such as build dependencies or software libraries and just stack each one on top of the last. This will allow you to prepare and use any optimisations offered, whether it’s in the libraries or application dependencies whilst limiting your build time. If you need to change something in a layer, you’ll have to rebuild everything based off of it too but it’s better this once in a while than continuously rebuilding everything from scratch.

If you don’t want to maintain several container image repositories, just use tags instead! This approach means you can more efficiently use your build time and concentrate on verifying the quality of the software you’ve produced rather than building the underlying dependencies.

5 - Remember, you’re an engineer

Machine learning can at times seem a bit overwhelming, its underlying theory is quite math heavy and some implementations require non trivial data processing, especially in areas such as computer vision. Furthermore, the models and initial research are usually produced by an entirely separate machine learning team, which is probably comprised of people with substantial backgrounds in research and academia.

Research teams have different priorities and focus to that of engineering, which is particularly evident when looking at any code they produce. Research teams, funnily enough, are focused on R&D, which means they probably are not concerned with scale, reliability or modular code design. Good software engineers should be able to take the prototypes that research teams produce, remove unnecessary or overly complex segments and produce the production version.

As a software engineer, the problems you’ll face in integrating machine learning models into your services are fundamentally similar as those you’ve already seen throughout your career. The problems can range from anything such as refactoring code into something more maintainable to designing your application architecture and approach to limit processing time. By systematically breaking down each problem you’ll start to recognise all too familiar patterns and then be able to leverage your software engineering experience to implement solutions throughout your infrastructure. Software engineering has a rich history with well trodden solutions to hundreds of problems and there is absolutely no reason why these can’t be used to assist us in building reliable and scalable machine learning based applications.


This post has outlined a few tips and strategies that are relatively simple to implement in any organisation experimenting with machine learning. Whilst machine learning can be a complex and sometimes confusing task to undertake, the software you build to train or use the models doesn’t have to be. Above all else, the fundamental idea I always try to keep in mind, whether its building software or interfacing with a machine learning model is the Keep It Simple Stupid (KISS) principle. If you consistently keep your approach as simple and modularised as possible, you’ll always find it easier to expand and scale your applications in the future. Don’t overthink it!

Sam Lacey


Sam Lacey

Founder, CTO and CEO
Singularity Technologies