Read time: 4 minutes
Two weeks ago I showed you how to build a real-time feature pipeline, that
ingested live raw trade data from Coinbase websocket API,
generated OHLC (Open-High-Low-Close) data in 30-second intervals, and
plotted these final features on a public Streamlit app.
That initial project was not bad. However, it lacked one crucial element in any ML-system wannabe. That is the storage of the generated features in a place where our ML models can later retrieve and use them, both for training and inference.
And this is something we will do today, by adding a Feature Store and deploying our code to an AWS EC2 instance.
Source code 💻
All the code is available in this GitHub repository.
Give it a ⭐ on GitHub if you find it useful 🤗
Feature Store
I chose Hopsworks Feature Store. Why?
because it is serverless, we do not need to handle infrastructure, and
because it has a very generous free tier, with 25GB of free storage.
Stream-processing engine
I again used Bytewax, a powerful Rust library with a clean Python API for stateful stream processing. In this file, you can find the dataflow definition, that
fetches raw trades from the Coinbase websocket API,
generates OHLC data every 30-seconds, and
saves it to the Hopsworks Feature Store.
Deployment
Bytewax has a command line tool called waxctl
that helps you deploy your dataflow on different compute engines, including:
Kubernetes clusters
Simple EC2 instances on AWS, and VMs in GCP.
For this proof of concept, I chose a small AWS EC2 instance. You can find the exact deploy command in the Makefile of the project
Tip 🧠
It is best practice to add a Makefile to your project, to automate and simplify tasks. For this project, I created one that does a few things:
$ make init
→ to set up the Python environment
$ make run
→ to run the dataflow locally
$ make deploy
→ to deploy the dataflow to AWS EC2
$ make delete
→ to free AWS resources
My advice
Go check the GitHub repository, play with the code (e.g. change the OHLC frequency from 30 seconds to 1 minute) and deploy it yourself.
Or get your hands dirty by changing the data source (for example Alpaca instead of Coinbase).
Go build.
Keep on learning.
And have a fantastic day.
Pau
Whenever you're ready, there is one thing I can help you
→ The Real-World ML Tutorial: learn to design, build, deploy and monitor a batch-scoring system that predicts taxi demand in NYC.