Read time: 5 minutes
Kaggle datasets and notebooks are a great way to start in Machine Learning. There you learn things like:
data preparation
feature engineering
model training
model hyperparameter tuning.
All these steps are necessary to build good ML models. However, in a real-world ML project, there are a few extra things you need to do, that you wonβt learn in Kaggle.
Here are the top 3 most recurring hidden problems I have faced in my ML life, and my tips for you to deal with them.
Tip #1. Understand the business problem first, then frame it as an ML problem πΌοΈ
Otherwise, you might end up building a great ML model in a Jupyter Notebook, that does NOT move the business metric it was intended to move.
This is what I call: βthe perfect solution for the wrong problemβ.
My advice
I always ask 3 questions at the beginning of every project:
What is the business outcome that management wants to improve?
It is crucial you talk with all relevant stakeholders at the beginning of the project. They have more business context than you and can help you understand what is the target you need to shoot at.Is there any solution currently working in production to solve this, like some rule-based heuristics?
If there is one, this is the benchmark you have to beat in order to have a business impact. Otherwise, you can have a quick win by implementing a non-ML solution.Is the model going to be used as a black box or as a tool to assist humans to make better decisions?
Creating black-box solutions is easier than explainable ones. If you work in healthcare, for example, you need explainability. If you work in financial trading, you don't.
If you can answer these 3 questions it means you know WHAT is the Machine Learning problem you need to solve.
And that is a fantastic starting point for the project.
Tip #2: Focus on getting more and better data π
In Kaggle competitions you are given a fixed and static dataset. In fact, all participants use the same data and compete against each other on who has the better model. The focus is on models, not on the data.
In reality, the exact opposite happens.
Machine Learning models are a combination of software (e.g. from a simple logistic regression all the way to a colossal Transformer) and DATA (capital letters, yes). Data is what makes projects successful or not, not models.
But, how do you get more and better data?
My advice
These are the 2 things I recommend you do:
You need to talk (a lot) with the data engineering guys.
They know where each bit of data is. They can help you fetch it and use it to generate useful features for your model. Also, they can build pipelines to add 3rd party data that can help you.You need to be fluent in SQL.
The most universal language to access data is SQL, so you need to be fluent in it. This is especially true if you work in a less data-evolved environment, like a startup. Knowing SQL lets you quickly build the training data for your models.
Tip #3. Structure well your code ποΈ
Jupyter notebooks are great for quickly prototyping and testing ideas. Python is a language designed for fast iterations, and Jupyter notebooks are the perfect match. However, notebooks quickly get crowded and unmanageable.
It is best practice to structure your Python code as a package and avoid code duplication.
My advice
Python Poetry is my favorite packaging tool. With just 3 commands you can generate most of the scaffolding you need.
$ poetry new my-package --name src
$ cd my-ml-package && poetry install
Poetry installs your local package as an editable Python dependency in your virtual environment, which means you can
define Python functions inside
src/my_file.py
, andimport them wherever you need the, for example, in a notebook, by running
from src.my_file import my_function
At the end of the day, every ML project I work on has a structure like this:
my-ml-package
βββ README.md
βββ data
β βββ test.csv
β βββ train.csv
β βββ validation.csv
βββ models
βββ notebooks
β βββ my_notebook.ipynb
βββ poetry.lock
βββ pyproject.toml
βββ queries
βββ src
βββ __init__.py
βββ inference.py
βββ data.py
βββ features.py
βββ train.py
Thatβs all for today folks.
I hope you learned something new.
Enjoy your day!
And see you next week.
Whenever you're ready, there are 2 ways I can help you:
If you are still struggling to understand how to build, deploy and monitor a complete ML system, Iβd recommend starting with a hands-on tutorial.
β The Real-World ML Tutorial is a hands-on tutorial, that shows from A to Z how to build, deploy and monitor a real-world ML product. You will also join my private Discord server, so you can ask me questions and connect with other students. Join here.Do you need help building an ML project or landing your first ML job?
Book a -on-1 session with me and get a personalized action plan.
Book your slot here.
Hello Pau,
Thanks for this reading. It's amazing focus to build ML products. The conversations with the stakeholders is essential (totally agree), sometimes takes more time to catch certain details in the business problem.
I wanna ask you two questions:
1. Do you recommend use virtual environment in python with poetry as well? (I know today this tool to manage the requirements). The majority of the time you can create a Jupyter notebook in Anaconda without virtual env and that's all.
2. What's the final product to our customer? (Could be the Jupyter notebook or the deploy)
Thanks a lot.
Best regards,
Michael