Read time: 5 minutes
99% of Machine Learning courses teach you how to build ML models using static datasets.
You are given one (or several) CSV files (or, even better, Parquet), explore the data, build your model… and that’s it.
Well, it is not.
The problem is that real-world ML services can only generate value once you plug them to live data sources. There is no static CSV file, but constantly flowing data, that needs to be processed and fed into your ML model.
And this is precisely what a feature pipeline does.
The feature pipeline
The feature pipeline is a program that fetches, transforms, and stores ML-ready data (ideally in a Feature Store) so the rest of the system can use it.
Typical data sources from where the feature pipeline ingests data are:
A data warehouse, for example, Google Big Query, where an enterprise typically stores product and user-related data in structured tables.
A Kafka topic, that channels streaming data coming from live user interactions with your app. For example, Netflix sends all your activity on their website in real-time, so their ML-based recommender system can suggest the best series for you to watch next.
An external API, for example, a web socket with live crypto prices that feed a trading bot.
Depending on the frequency at which the feature pipeline runs, we can distinguish between 2 types.
Batch feature pipeline
Streaming feature pipeline
Let’s see what they are and how you can implement them.
Batch feature pipeline
A batch feature pipeline is a program, often written in Python or Spark (e.g. PySpark), that fetches data and generates features on a schedule, for example:
daily
hourly
every 10 minutes
To implement a feature pipeline you need 2 pieces of infrastructure:
Computing, that is, a virtual machine where you can run your Python or Spark code runs. For example:
a GitHub Virtual Machine, which is free, if you do not worry much about scale and wanna quickly build a fully working MVP.
a Kubernetes cluster, if you work on a company that has that infrastructure.
a Spark cluster, if you are processing a large amount of data.
or AWS Lambda function, if you do not wanna worry about infrastructure.
Orchestration, to schedule and trigger the execution of the pipeline. There are several options here, including:
GitHub actions, which again, are free.
Apache Airflow, the dominant orchestration tool in the data engineering space for the last decade. If you are not a big fan of managing infrastructure yourself (as myself) I recommend you pick a managed Airflow service by a cloud provider, like AWS or GCP.
Prefect, a managed orchestration platform that is growing in popularity.
Streaming feature pipeline
A streaming pipeline is a program that is constantly ingesting data, processing it, and serving it downstream, either to a message bus (e.g. Apache Kafka) or a Feature Store.
Typical data sources are:
a WebSocket API, for example, the Coinbase API for real-time crypto prices.
a message bus like Apache Kafka (open-source), Pub/Sub (Google), or Amazon Kinesis (AWS) that transports real-time user data from production apps into your backend services.
Stream processing time is a often critical parameter in your pipeline.
Example 💡
Imagine you are fetching real-time crypto prices, and generating features for your trading bot. A couple of seconds of delay can really make a difference in your trading profits.
Python, the preferred language by most data scientists and ML engineers, is a very slow language. Hence, stream processing tools are usually implemented in more efficient languages, like Java/Scala (JVM) or Rust.
These are some of the most popular stream processing engines:
Apache Spark Streaming (Java/Scala)
Apache Flink (Java/Scala)
Bytewax, which is built on top of Rust (a highly performant language) and has a very expressive and Python-friendly API.
And here is the challenge 🚀
I am a firm believer that the only way to learn ML is to build ML.
Hence, I am challenging you to build a feature pipeline.
How?
Pick an API you are interested in from this list.
Build a feature pipeline, either batch or streaming. If this is your first pipeline, I suggest you choose batch processing and free service like GitHub actions to run and schedule them.
Save the generated features to a Feature Store (like Hopsworks)
(Optionally) Build a frontend dashboard with Streamlit, to visualize them.
I will be more than happy to promote your work and give you visibility through my Twitter and LinkedIn accounts.
Keep on learning!
Pau
Simply explained complex terms. Kudos!