Real-World ML #019: Deploy a real-time feature pipeline to AWS

Bytewax + Hopsworks + AWS = 💪

Apr 08, 2023

Read time: 4 minutes

Two weeks ago I showed you how to build a real-time feature pipeline, that

ingested live raw trade data from Coinbase websocket API,
generated OHLC (Open-High-Low-Close) data in 30-second intervals, and
plotted these final features on a public Streamlit app.

That initial project was not bad. However, it lacked one crucial element in any ML-system wannabe. That is the storage of the generated features in a place where our ML models can later retrieve and use them, both for training and inference.

And this is something we will do today, by adding a Feature Store and deploying our code to an AWS EC2 instance.

Source code ‍💻
All the code is available in this GitHub repository.
Give it a ⭐ on GitHub if you find it useful 🤗

Feature Store

I chose Hopsworks Feature Store. Why?

because it is serverless, we do not need to handle infrastructure, and
because it has a very generous free tier, with 25GB of free storage.

Stream-processing engine

I again used Bytewax, a powerful Rust library with a clean Python API for stateful stream processing. In this file, you can find the dataflow definition, that

fetches raw trades from the Coinbase websocket API,
generates OHLC data every 30-seconds, and
saves it to the Hopsworks Feature Store.

Deployment

Bytewax has a command line tool called waxctl that helps you deploy your dataflow on different compute engines, including:

Kubernetes clusters
Simple EC2 instances on AWS, and VMs in GCP.

For this proof of concept, I chose a small AWS EC2 instance. You can find the exact deploy command in the Makefile of the project

Tip 🧠
It is best practice to add a Makefile to your project, to automate and simplify tasks. For this project, I created one that does a few things:
$ make init → to set up the Python environment
$ make run → to run the dataflow locally
$ make deploy → to deploy the dataflow to AWS EC2
$ make delete → to free AWS resources

My advice

Go check the GitHub repository, play with the code (e.g. change the OHLC frequency from 30 seconds to 1 minute) and deploy it yourself.

Or get your hands dirty by changing the data source (for example Alpaca instead of Coinbase).

Go build.

Keep on learning.

And have a fantastic day.

Pau

Whenever you're ready, there is one thing I can help you

→ The Real-World ML Tutorial: learn to design, build, deploy and monitor a batch-scoring system that predicts taxi demand in NYC.

Real-World Machine Learning

Discussion about this post