Where does training data come from?

Spoiler alert... not from Kaggle datasets.

Sep 21, 2024

In real-world ML projects training data does not magically fall from the sky as in Kaggle. Instead, you have to generate it yourself.

And the truth is, generating this training data takes way more time, effort and debugging than training the ML models later on.

Let me share with you a few tricks to generate good training data, so you can train great ML models, build great ML systems, and become a great ML engineer.

The problem 🤔

Let’s say you want to build a crypto price predictor, like the one we are building in the “Building a Real-Time ML System. Together”.

Before you can train an ML model you need a dataset with historical features, and targets. And to generate this dataset you need to build at least 1 feature pipeline.

What is a feature pipeline?
A feature pipeline is a computer program that transforms raw data into reusable Machine Learning model features.

For example 💡

You can build a real-time feature pipeline, that runs 24/7 and does 3 things:

Ingests raw trades from an external API, either live or in batches.
Transforms these trades into OHLC candles in real-time using a Python library like Quix Streams.
Saves these OHLC feature into the Feature Store.

Each of these steps is implemented as an independent dockerized Python microservice, and data flows between them through a message broker like Apache Kafka/Redpanda/Google PubSub. This design makes the system production-ready and scalable from day 1.

Once you have the feature pipeline up and running, real-time features start flowing to your Feature Store.

However, there is still a problem…

Problem ❗
To train good ML models you need to have a significant amount of historical features in the store. Without enough historical data in your store, you cannot train a good predictive model.

You have 2 options to generate these historical features:

Wait a few days/weeks/months until your real-time feature pipeline generates enough data (what a bummer! 😞), or
Find another source of historical raw data (in our case trades), feed it to your feature pipeline and generate historical features. This operation is called feature backfilling, and it is definitely more appealing than just waiting 🤩

Let me show you how to do feature backfilling the right way.

Solution → Feature backfilling 🔙

To backfill historical features you first need to have access to raw historical data. In our case, Kraken has both

A Websocket API that serves trades in real-time, and
A REST API that serves historical trades for the time period we want.

Both data sources are consistent, meaning the trades that the websocket serves in real-time are the exact ones that are later on available as historical trades from the REST API.

Very important 🚨
All raw data (no matter whether it is live or historical) needs to be transformed into features using the exact same Python code.
Otherwise, the features that you will use to train your ML models might be slightly different that the ones you send to your model once deployed, and your model won’t work as you expect!

So, instead of re-writing the entire pipeline, you need to adjust the code in 2 places:

The ingestion step, to connect either to the live Websocket API or the Historical REST API, and
The serialization step, to save live features to the online store, and historical features to the offline store.

You can switch these 2 components based on your script input parameters, for example

To ingest either live or historical trades

To save features either to the online or the offline Feature Store

And remember, the transformation step is EXACTLY THE SAME, no matter if the feature pipeline runs with live data or historical data.

BOOM!

Wanna learn more Data Management for ML in the real world? 🧠

Register for FREE for the upcoming Feature Store Summit 2024 next October 15th 2024.

It is a fully online and FREE event where you will learn best-practices for Data Management in ML, from top companies like Uber, Airbnb, Stripe, Quix, Bytewax and Hopsworks among others.

Click to register for FREE

Talk to you next week,

Same place. Same time.

Pau

Real-World Machine Learning

Discussion about this post