How to get JSON out of a Visual Language Model?
Outlines is your new best friend
Two weeks ago we started building a basic image classification system using a super small and light Visual Language Model.
You missed that? No problem, here’s the blog ↓
We managed to get good results on our evaluation dataset with 100 samples:
Accuracy: 0.99
✅ Evaluation completed successfully99% accuracy meant only 1 sample was misclassified.
However, that one sample was not a misclassification error, meaning the model chose the wrong class (e.g picked cat when the image was a dog). It was way worse than that.
The model “invented” a new class “pug” which we never asked for.
Today I want to show you how to fix this error by “forcing” the Language Model to output data with the right format. This is what structured generation is all about.
Let’s start!
What’s the problem?
Language Models are not logic machines. They are just next-token predictors.
They generate text based on probability distributions, not strict logic. So, even when you add in your prompt instructions to choose
A → “cat” or
B → “dog”
the model might generate
C → “pug”
if those tokens seem plausible in context.
So the question is
How can we “force” the model to output a specific format?
This is where structured generation comes to the rescue.
Let me show you.
What is structured generation?
Structured generation is a technique that allows us to “force” the Language Model to output a specific format, like JSON.
Remember, Language Models generate text by sampling one token at a time. At each step of the decoding process, the model generates a probability distribution over the next token and samples one token from it.
Structured generation techniques “intervene” at each step of the decoding process, by masking tokens that are not compatible with the structured output we want to generate.
For example, in our case we want the model to output either this
{”pred_class”: “dog”}or this
{”pred_class”: “cat”}If we use structured generation the decoding process will be
Deterministic up to the 5th token.
Non-deterministic at token 6, where the model needs to choose between “dog” and “cat”.
Deterministic at token 7, where the model is forced to output the “}” token.
Ok, enough talking. Let’s see how to add structured generation in our example.
Hands-on example
👉 All the code I am showing is available in this repository. I would appreciate a star ⭐ on Github if you get value from it 😉
We will be using Outlines, an open-source library for structured generation developed by dottxt.ai, which is super easy to use and it just *works*.
These are the main steps:
Load the VLM (LFM2-VL-1.6B in our case) and its pre-processor.
from transformers import ( AutoModelForImageTextToText, AutoProcessor ) model_id = “LiquidAI/LFM2-VL-1.6B” print(”📚 Loading processor...”) processor = AutoProcessor.from_pretrained( model_id, trust_remote_code=True, max_image_tokens=256, ) print(”🧠 Loading model...”) model = AutoModelForImageTextToText.from_pretrained( model_id, torch_dtype=”bfloat16”, trust_remote_code=True, device_map=”auto”, )Define the output schema using Pydantic
from pydantic import BaseModel class CatsVsDogsClassificationOutputType(BaseModel): pred_class: Literal[”cat”, “dog”]Create an Outlines model from the original model and the pre-processor, and a generator that uses this model, plus the output_schema you defined.
import outlines model = outlines.from_transformers(model, processor) output_generator = outlines.Generator(model, output_schema)Generate the prompt from the messages history (without the image)
messages = [ { “role”: “system”, “content”: [{”type”: “text”, “text”: system_prompt}], }, { “role”: “user”, “content”: [ {”type”: “text”, “text”: user_prompt}, {”type”: “image”, “image”: “”}, ], }, ] prompt = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True )Pass the image and generate the response
response: str = output_generator({”text”: prompt, “images”: image})Parse the response string into the object type you defined with Pydantic.
try: # Parse the response into the structured output type response = output_schema.model_validate_json(response) return response except Exception as e: print(”Error generating structured output: “, e) print(”Raw model output: “, response) return None
All these steps are implemented in the get_structured_model_output() function in inference.py.
To see structured generation in action, go to the command line and run:
make evaluate CONFIG_FILE=cats_vs_dogs_v2.yamland you will get the following output:
Accuracy: 0.98
✅ Evaluation completed successfullyWhich is both good news and bad news.
The good news is that the model does not hallucinate “pug” labels anymore. If you re-run the notebook you will find the 2 misclassified examples.
The bad news is that the model is still making mistakes.
And look, I think we should be doing better than that.
So next week I will show you how to fine tune the model, so we squeeze all the juice out of it and get better results.
Wanna learn Real World AI engineering with me?
Last week I started working as a Dev Rel Engineer at Liquid AI.
My mission is super simple:
Help you build AI systems that **work**
I am preparing the first Ask Me Anything session, where I plan to answer as many questions as possible from you.
I want to hear what your frustrations are, and see how I can help you by building hands-on content around the topics you care about.
Join the Liquid AI Discord Community, so you don’t miss any event
Enjoy the weekend,
Say “I love you” to whatever person you happen to love,
Talk to you next week,
Pau









