On Machine Learning in Production

August 2, 2024·5 min read·

EngineeringMachine Learning

There's a gap between what machine learning looks like in a tutorial and what it looks like at 2am when your model is returning nonsense and you have no idea why.

I've spent a lot of time on both sides. Here's what I've learned.

The Gap Is Real

Most ML content is about model accuracy. Getting from 92% to 94%. Trying a new architecture. Fine-tuning hyperparameters.

Production ML is almost never about that. It's about:

Why did the model suddenly start behaving differently?
What happened to the data pipeline upstream?
Why does the model work fine in staging and break in prod?

The model is usually fine. The infrastructure around it is where things go wrong.

Data First, Always

Every bad model I've shipped had a data problem, not a model problem.

Before you spend time on architecture, spend time on your data:

Understand the distribution. What does your training data actually look like? Plot everything.
Write data tests. Test that your schema is what you think it is. Test that ranges are reasonable. Test for nulls.
Log your inputs. If something breaks, you need to know what went in.

I can't tell you how many times I've traced a bug back to a data type mismatch or a silent schema change upstream.

Versioning Everything

Models are artifacts. Treat them like code.

This means:

Version your models explicitly (not model_final_v2_FINAL.pkl)
Track which data version trained which model version
Be able to roll back to a previous model in under five minutes

MLflow is fine for this. A simple naming convention in S3 is also fine. The tooling matters less than the discipline.

The Feedback Loop

A model that ships and gets forgotten will degrade. The world changes. Your model doesn't know that.

Build monitoring in from the start:

Track prediction distributions over time
Alert on distribution shift
Have a scheduled retraining pipeline, even if you don't use it often

Most importantly: make it easy to retrain. If retraining requires a heroic effort, it won't happen until things are already broken.

Keep It Simple

The best model I've ever shipped was a gradient boosted tree on 12 features. It ran in under a millisecond, was easy to debug, and stayed working for two years with minimal maintenance.

The worst was a complex ensemble with a neural component that nobody on the team fully understood. It was marginally more accurate and cost 10x more in operational headache.

Reach for complexity only when simplicity has failed. It almost never needs to.