On Monday we had the first Monday’s Coffee at the Real World ML Community.
What is the Real World ML Community?
It is a private Discord community with almost 700 hungry ML/AI human builders. They are all human, because they all happen to be students from one or more of my courses.
And as far as I know, bots do not enrol in production-grade ML courses.
Every Monday we spend 60-minutes discussing one business/engineering topic, while enjoying a coffee ☕. And we always kick things off by asking one VERY SPECIFIC question.
Last Monday's question was:
How can I (or my company) deploy my own LLM server❓
And before we enter into the how, let me cover the why.
Why deploy an LLM server?
There are 2 main reasons why you might want to self-host LLM servers. The most important one is data privacy. The second one is cost optimisation.
Data privacy
Also called "stay away from my data, please"
Many companies don't want to send their data to a third-party server.
I have worked for several startups in the health tech space who will NEVER send their data to a third-party server, name it OpenAI, Anthropic or even the self-proclaimed privacy-focused AWS Bedrock.
Cost optimisation
If data privacy is not a deal-breaker for you or your company, you might consider using third-party LLM providers like OpenAI or Antrhopic, and pay as you go (aka serverless). You only pay for the tokens you use.
This, in my experience, is a great option when you get started building the first version of your product. However, if the product gets traction and you start increasing the amount of tokens generated by third-party LLMs, your bills will start to grow, and at some point self-hosting LLMs will be more cost-effective.
In other words, pay-as-you-go cost grows linearly with respect to your token usage, while self-hosted is more like a step-wise function, in which the jumps are the moments where you find the need to upgrade the underlying hardware of your GPU instance.
For example
It might be ok to start with a single GPU node. But as traffic increases, you might call the infra guy at your company (aka your Marius) and kindly ask him to either
upgrade this node (for example from A600 to H100) → vertical scaling, or
add a second node → horizontal scaling.
So, having said all this, let's now move on to the how.
HOW to deploy an LLM server?
What follows are the main steps we discussed on Monday with Marius, Shihab, Peter and other members of the community.
This is the fastest way I know to get up and running with an LLM server that you can roll out and let your AI engineers test.
Tip 💡
If you want to let your AI engineering team experiment with different LLM providers (like the one they currently use, vs the one you are about to roll out), you can use Litellm to offer a single LLM API gateway to your AI engineering team.
Step 1. Rent (or buy) a GPU instance
First of all, you need to either buy or rent a GPU instance.
Tip 💡
Before investing too much money up front, it is best to rent a lower-end NVIDIA GPU instance, like an A4000, and validate your end-2-end deployment workflow, and the cost impact of this new setup.
Once you are confident it works, you can upgrade to a higher-end GPU instance, like an H100.
To rent a GPU you can do so in Digital Ocean, Paperspace, AWS, GCP, etc.
Step 2. Install CUDA drivers and the NVIDIA container runtime
We will be running our LLM server as a Docker container inside this GPU instance.
Which means you need to have:
A Docker runtime, for example Docker or Podman, or even better, containerd.
CUDA drivers, that are the low-level system software from NVIDIA that enable your operating system (e.g. Ubuntu) and applications to communicate with NVIDIA GPUs.
NVIDIA container runtime, that acts as a wrapper around the default container runtime (like runc). It automatically injects GPU libraries and devices into containers when the --gpus flag is used (you will see that in a second).
You can install these tools manually, or even better, using an Ansible playbook.
This is an example of fragment of an Ansible playbook to install the NVIDIA tooling in an Ubuntu instance.
- name: Update apt and install required nvidia packages
apt:
pkg:
- nvidia-container-toolkit
- nvidia-container-runtime
- cuda-drivers-fabricmanager-535
- nvidia-headless-535-server
- nvidia-utils-535-server
state: latest
update_cache: true
tags:
- apt
Step 3. Choose the LLM serving tool
There are several open-source LLM serving tools out there, including:
Text Generation Interface (TGI) by Hugging Face. This is the tool that Hugging Face as a company uses to serve LLMs predictions.
vLLM, originally developed at UC Berkeley, a currently developed and mantained by an open-source community from both academia and the industry.
An NVIDIA NIM container for LLM inference (obviously by NVIDIA). NVIDIA NIM is a collection of optimised Docker containers for running inference on NVIDIA-accelerated infrastructure, exposing an OpenAI compatible API that you can easily integrate with.
Marius Rugan, the infra guru in the Community, suggested to use NVIDIA NIM. And this is what we did.
Let's now see how you can spin up an LLM server using an NVIDIA NIM container.
Step 4. Pull and run a NIM container
For example, let's see how to pull and run the Meta Llama 3.1 8B Instruct model.
First you need to login to the NVIDIA container registry.
$ docker login nvcr.io
Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>
Then, you need to set your API as an environment variable, and create a directory in the GPU to store the cached models. It is important to enable caching, otherwise, you might end up downloading the same model over and over again. And you know LLMs are big.
export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
Then, you can pull and run the model.
docker run -it --rm \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
Observer how we
enable the `—gpus` flag, so the container can talk to the GPU.
port-forward the container's HTTP API to the host machine, so we can server predictions over the network.
If you have a public IP address (which you usually get by default when you rent a GPU instance), you can now send requests to the LLM server as follows:
curl -X 'POST' \
'http://<GPU_INSTANCE_PUBLIC_IP>:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.1-8b-instruct",
"messages": [{"role":"user", "content":"What is the capital of Mars?"}],
"max_tokens": 64
}'
BOOM!
If it works, feel free to rent a more powerful instance, and burn the GPU.
IMPORTANT ⚠️
If you are renting a GPU, and you are done you should delete/destroy the machine. Stopping the machine is not enough, and you will continue to be billed.
That’s a wrap → 🌯
Data privacy in the age of LLMs is a big concern for many companies. However, what we discusses today covers only the "inference" side of things.
There is a more fundamental data privacy concern, that comes up before, when training LLMs or any other ML model. This is what Shihab brought up in Monday’s Coffee discussion that inevitably led us to Federated Learning.
But this is a topic for another Monday, and another newsletter.
Wanna join the Real World ML Community?
If you want to level up your ML/AI/LLM game, and learn from curious engineering minds, enrol in one of my courses and get lifetime access to the Real World ML Community.
The Community is the place where I spend most of my time these days.
Because it is not about social-media-marketingish-robotic-content-and-interactions. It is about humans learning and talking to other humans, how to build machines that can help humans.
See you on the other side.
Pau
Great post, very practical ! I have a question: why did you choose NVIDIA NIM over vLLM? From what I understand, vLLM stands out from other inference frameworks, especially thanks to its PagedAttention feature. that makes it very efficient in production to handle many concurrent requests