3 Comments
User's avatar
VincentGim's avatar

Great post, very practical ! I have a question: why did you choose NVIDIA NIM over vLLM? From what I understand, vLLM stands out from other inference frameworks, especially thanks to its PagedAttention feature. that makes it very efficient in production to handle many concurrent requests

Expand full comment
Pau Labarta Bajo's avatar

Good question. We should find (or run) a few tests to benchmark both inference engines. This is something I have been thinking for weeks, but could not find the time

Expand full comment
VincentGim's avatar

I did some research, it turns out NIM is actually better and has also a kind of “pagedattention”. A benchmark ;) :https://www.macnica.co.jp/en/business/semiconductor/articles/nvidia/145990/?utm_source=chatgpt.com

Expand full comment