Object tracking with LLMs

Plus debugging your RAG app

Jun 28, 2025

You are the average of the five people you spend the most time with.

This is something I've been thinking about for a while.

And this is the reason why I am spending less time on bot-dominated social media, and more time with humans in my Discord Community, The Real World ML Community.

What is the Real World ML Community?
It is a private Discord community with more than 700 ML-curious engineers. They are all human, because they all happen to be students from one or more of my courses.
And as far as I know, bots do not enrol in production-grade ML courses.
If you want to join, enrol in one of our bootcamps.

Every Monday we spend 60-minutes, all web-cams on, discussing one or more engineering topics, and of course we do that while we drink our favorite beverage. Mine is coffee.

Let me share with you today what we discussed on Monday

You can watch the full session vide recording here:

1. Object tracking with Moondream, Zenoh and Rerun.io

First up, our infra master chef Marius decided to blow our minds with an object tracking system that's apparently so lightweight it could probably run on a smart toaster.

Marius adjusted this Github repo that Ben put together → Link to the repo
Thanks Ben and Marius for the gem.

The main building blocks of the system are:

Zenoh as the pub/sub message broker that pipes the webcam feed to the rest of the object tracking system. Zenoh is a "superfast, zeroconfig real-time data transmission" system. Translation: it moves data around so quickly that your webcam feed probably gets motion sickness.
Moondream as the visual LLM that's only 2B parameters (that's like... really small in AI terms, trust me). This little guy can look at images and answer questions about them, or draw bounding boxes around objects. Basically, it's like having a very smart, very tiny art critic living in your computer.
Rerun.io, to visualize streams of real-time data, like images. It is also the time-traveling debugger of our dreams. It plots high-dimensional data in real-time AND has time-travel capabilities.

The best part? We got 2 super interesting follow ups (because this is what happens when you bring a bunch of smart and curious people together. Great questions pop up)

Cristiano is using similar tech to track gym exercises. Imagine having an AI that can tell you "Hey buddy, you did 10 reps, but 3 of them were... questionable at best." It's like having a personal trainer who never gets tired of judging your form.
Carlo asked the million-dollar question: "Why are visual LLMs like Moondream so tiny compared to those text-hungry monsters that usually need 7B+ parameters to function properly?"

I cannot say for sure why, buy my gut feeling is that the quality of its training data is so good, that it doesn't need to be as big as the text-hungry monsters trained on super large (and dirty) datasets.

To train a visual LLM model that can answer questions about images, you need tuples:

user question (text)
answer (text)
image (bytes)

So, you can use an image generator model (for example DALL-E 3) to generate the image given the question and the answer. By adjusting the prompt you can generate enough samples to train a vLLM model, and add diversity to the dataset.

Simple and beautiful.

If you have another explanation, please share it on the comments below 👇

After this incursion into the world of vLLMs, we moved on to the next topic: RAG Debugging.

2. Debugging a RAG system

Rohit brought us a problem he is facing while building a RAG app for legal documents.In a nutshell, the retrieval is not working as expected.

And the question is, how to debug it?

Retrieval is just the last step of a potentially buggy pipeline, that includes:

(maybe) Document chunking
(maybe) Document embedding
Document retrieval

My recommendation is to build a golden dataset of question-answer pairs (a few tens of them will do) and evaluate end-2-end these 3 steps.

To build a dataset, you can manually inspect the legal documents, and generate some questions for which you can find the exact answer in the document.

Doing a bit of good-old error analysis using the traces you send to Opik or Langsmith, will help you reveal the root cause of the problem.

3. The Kubernetes finale

There is no Monday Coffee without our weekly dose of Kubernetes. And last Monday was not exception.

This time, Marius brought us Linux Talos.

What is Talos?
Talos is a modern open-source Operating System for running Kubernetes. It is
Safer → Talos reduces your attack surface: It's minimal, hardened, and immutable. All API access is secured with mutual TLS (mTLS) authentication.
Easier to manage → Talos eliminates configuration drift, reduces unknown factors by employing immutable infrastructure ideology, and delivers atomic updates.
than doing the usual Linux OS install → SSH → Ansible → …

4. The Takeaway

Every Monday reminds me why I love hanging out with engineers: they get genuinely excited about 2-billion parameter models being "tiny" and can have passionate discussions about data transmission protocols.

Plus, where else can you learn about time-traveling debuggers and AI gym trainers in the same conversation?

Now, if you’ll excuse me

I will continue working on the LLMOps bootcamp that Marius and I are preparing for the first week of October, in which we will teach how to develop, deploy and operate agentic software in Kubernetes.

In the first cohort we will build an agentic trading platform, and the first draft of it looks like this:

Wanna join?

If you want to join the bootcamp, and get lifetime access to all future cohorts (which means you pay once and attend as many cohorts as you want, without paying again) you can still grab the massive 40% discount we have until the end of the month (which means less than 3 days).

Join the LLMOps bootcamp

Talk to you next week,

Pau

Real-World Machine Learning

Discussion about this post