RWML #026: 4 steps to build real-world ML products

Part 2. Data preparation 📊

May 27, 2023

Read time: 3 minutes

This is Part 2 of this mini-series on how to solve real-world business problems using Machine Learning.

Last week we covered Problem framing. Today, we will take the next step, and learn how to prepare the data.

Remember 🙋

The 4 steps to building a real-world ML product are
Problem framing (last week)
Data preparation (today) 📊
Model training (next week)
MLOps

Example

Imagine you work at a ride-sharing app company in NYC as an ML engineer. And you want to help the operations team allocate the fleet of drivers optimally each hour of the day. The end goal is to maximize revenue.

Last week you learned how to frame this business problem as an ML problem.

Problem framing 🖼️

We will build a predictive model for taxi demand. The model will predict how many rides will be requested
on each area of NYC
in the following 60 minutes

Before we can start building any ML model, we need to prepare the data.

Step 2. Data preparation

In real-world ML projects there is no Kaggle-like dataset with N columns for the features and 1 with the target. Instead, you have to create this dataset yourself, starting with raw data.

In this case, you have the list of taxi rides that have happened in NYC in the last 24 months, including

the date and time of the ride, and
the pickup location

This data is collected by the application backend and sent to data storage (aka data warehouse), where you can read it and use it for your ML service.

However, before doing so, you need to pre-process, by following these 3 steps:

Data validation.
Remove wrong or buggy ride events. For example, test events generated by your development team that accidentally ended up in the production data.
Aggregation of events into time-series data.
Your model will use historical one-hour data intervals
Transformation of time-series data into pairs (features, target)

How to transform time-series data into Supervised ML data?

Most Supervised ML models (e.g. XGBoost) do not work directly with time-series data. Instead, you need to pre-process time-series data, into pairs (features, target), where

features are the model inputs
target is the model output

To transform time-series data into (features, target) pairs you define

a window length (e.g. last 12 hours) for the size of the input feature vector.
a step size (e.g. 1 hour), to control the total number of samples.

and you apply a slice-and-dice operation.

My advice 🧠

It is best to package all the data preprocessing steps into a function, that you can run from the command line, and that you can later use as your feature pipeline (more on this in 2 weeks).

Next steps

So far we have

✅ defined the ML problem to solve and

✅ generated our training data ready.

Next week, we will move on to step 3, aka model training.

Wanna design, develop and deploy this ML system yourself?

Join the Real-World ML Tutorial + Community and get LIFETIME ACCESS to

→ 3 hours of video lectures 🎬
→ Full source code implementation 👨‍💻
→ Discord private community, to connect with me and 100+ students 👨‍👩‍👦

Have a great weekend

And keep on learning!

Pau

Real-World Machine Learning

Discussion about this post