How to deploy an LLM server yourself?

Jun 21

Fresh from the Real World ML Community

3 Comments

Great post, very practical ! I have a question: why did you choose NVIDIA NIM over vLLM? From what I understand, vLLM stands out from other inference frameworks, especially thanks to its PagedAttention feature. that makes it very efficient in production to handle many concurrent requests

Expand full comment

Reply (2)

Pau Labarta Bajo

Jun 21

Good question. We should find (or run) a few tests to benchmark both inference engines. This is something I have been thinking for weeks, but could not find the time

Expand full comment

VincentGim

Jun 21

I did some research, it turns out NIM is actually better and has also a kind of “pagedattention”. A benchmark ;) :https://www.macnica.co.jp/en/business/semiconductor/articles/nvidia/145990/?utm_source=chatgpt.com

Expand full comment

Real-World Machine Learning

How to deploy an LLM server yourself?