Read time: 3 minutes
This is Part 2 of this mini-series on how to solve real-world business problems using Machine Learning.
Last week we covered Problem framing. Today, we will take the next step, and learn how to prepare the data.
Remember ๐
The 4 steps to building a real-world ML product are
Problem framing (last week)
Data preparation (today) ๐
Model training (next week)
MLOps
Example
Imagine you work at a ride-sharing app company in NYC as an ML engineer. And you want to help the operations team allocate the fleet of drivers optimally each hour of the day. The end goal is to maximize revenue.
Last week you learned how to frame this business problem as an ML problem.
Problem framing ๐ผ๏ธ
We will build a predictive model for taxi demand. The model will predict how many rides will be requested
on each area of NYC
in the following 60 minutes
Before we can start building any ML model, we need to prepare the data.
Step 2. Data preparation
In real-world ML projects there is no Kaggle-like dataset with N columns for the features and 1 with the target. Instead, you have to create this dataset yourself, starting with raw data.
In this case, you have the list of taxi rides that have happened in NYC in the last 24 months, including
the date and time of the ride, and
the pickup location
This data is collected by the application backend and sent to data storage (aka data warehouse), where you can read it and use it for your ML service.
However, before doing so, you need to pre-process, by following these 3 steps:
Data validation.
Remove wrong or buggy ride events. For example, test events generated by your development team that accidentally ended up in the production data.Aggregation of events into time-series data.
Your model will use historical one-hour data intervalsTransformation of time-series data into pairs (features, target)
How to transform time-series data into Supervised ML data?
Most Supervised ML models (e.g. XGBoost) do not work directly with time-series data. Instead, you need to pre-process time-series data, into pairs (features, target), where
features are the model inputs
target is the model output
To transform time-series data into (features, target) pairs you define
a window length (e.g. last 12 hours) for the size of the input feature vector.
a step size (e.g. 1 hour), to control the total number of samples.
and you apply a slice-and-dice operation.
My advice ๐ง
It is best to package all the data preprocessing steps into a function, that you can run from the command line, and that you can later use as your feature pipeline (more on this in 2 weeks).
Next steps
So far we have
โ defined the ML problem to solve and
โ generated our training data ready.
Next week, we will move on to step 3, aka model training.
Wanna design, develop and deploy this ML system yourself?
Join the Real-World ML Tutorial + Community and get LIFETIME ACCESS to
โ 3 hours of video lectures ๐ฌ
โ Full source code implementation ๐จโ๐ป
โ Discord private community, to connect with me and 100+ students ๐จโ๐ฉโ๐ฆ
Have a great weekend
And keep on learning!
Pau