How do LLMs learn to reason?
These days I am on the seaside and I am teaching my 4-year son Kai and my 5-year old cousin Sofia the basic 4 arithmetic operations: addition, subtraction, multiplication and division.
And the thing is, I want them to learn to think, not just memorise a table of results.
Every day I ask them a question like this:
The problem
“Imagine Kai and Sofia that you both are on the shore of the beach. Sofia decides she will swim, while Kai will take a longboard and paddle instead, and go in the same direction. If Sofia swims at 2 km per hour and Kai paddles at 4 km per hour, how far apart will you be in 2 hours”
This is not an easy question for a 4-year old.
This is a multi-step journey from a set of premises and information, to the correct output, that requires some modelling, and basic arithmetics.
Problems like these (and way harder, of course) are what AI researchers call today, reasoning problems.
And the thing is, it turns out LLMs are not that bad at generating (some sort of) reasoning that has real-world business impact. If you train them well.
In this post, I want to summarise some of the things I have been learning these days with my kids, and from a few top tech blogs I have found on the internet, that can help you understand better how LLMs (and maybe even human kids) can learn to reason.
How do humans learn to solve complex problems?
We humans are animals who can reason.
According to Socrates (one of the most influential philosophers in history), reasoning is innate in human beings. However, the skill of using that reasoning effectively depends on how much you put it to work.
How?
Every learning requires a teacher. Someone else (sometimes even you) whose role isn't to transfer knowledge but to act as a "midwife" helping the student develop the patterns of reasoning and understanding that our human brain is ready for.
For example, when I ask Kai and Sofia the challenge from above, I am not as interested in the final output, as much as I am in the thought process they follow to find their answer.
In particular, I am interested in seeing if they can break down the resolution of the problem in 3 steps:
Step 1
“First we will see how far Sofia gets in 2 hours. As she swims at 2 km per hour, after 1 hour she will swim 2 km. After 2 hours she will swim another 2 km. That is 2 plus 2, and that is… 4.
Step 2
On the other hand, Kai paddles at 4 km per hour. This means that in one hour he will paddle 4 km. If he paddles for 2 hours, that means he will move 4 more km. That is a total of 4 plus 4, that equals … 8 “
Step 3
Finally, Sofia has moved 4 km and Kai 8 km, in the same direction. The distance between them will be the gap between 4 and 8. That is the subtraction 8 minus 4, and that is … 4”
EUREKA!
Whenever they shout out their answer, for example “FOOOUR!”My next question is WHY?
Sometimes they get the answer right, but their reasoning is completely nonsense. As a teacher, my role here is to stir them back to this 3 step process, and see where things went wrong, and continue from there.
Other times, they get the reasoning right, but the arithmetic part is wrong.
As a teacher, I think the latter is better than the former. This is where I see learning happening.
Because learning to reason is a skill that transfers from problem to problem. This is why LLMs that can mimic these capabilities have a vast range of applications. They are not just memorizers, they are reasoning pattern matchers that can be used in many scenarios, including planning in multi-step agentic workflows.
Now, this paradigm of learning works for one-on-one human teaching. But the question you might have as an AI engineer/researcher is
How can you teach an Language Model to reason?
How do LLMs learn to reason
Large base LLMs, from 2 or 3 years ago, already showed some reasoning capabilities, although they were not explicitly trained for reasoning tasks.
One of the most popular ways, and used by AI practitioners like you and me every day, is Chain Of Thought prompting. The idea is to add some intermediate reasoning steps in the prompt that act as milestones for the LLM to keep itself on the right track towards the solution.
Brilliant.
Now, once you observe that adding extra tokens in your prompt at inference time helps the model perform better, the next obvious question is:
Is there a way to induce this learning at training time, and specialize the model on reasoning tasks?
This is precisely what AI researchers are working intensively for the last year or so, especially since the release of the first open-source reasoning model, DeepSeek-R1
These days, researchers use 2 methods, that they combine in different ways, to add extra reasoning capabilities on top of pre-trained Large Language Models. They often refer to “post-training” techniques.
Supervised fine-tuning using highly curated Chain Of Thought datasets. The dataset contains pairs with (input, output), where the output includes both the reasoning and the final answer. This data is high-quality, but creating it is time-consuming and hard to scale.
Reinforcement learning, using a reward model that measures both the accuracy of the final answer, and the format of the thought process.
One of the most remarkable advances in AI this past year is what the DeepSeek found out:
Applying reinforcement learning alone on top of a large pre-trained LLM (like DeepSeek-v3) is enough for the model to pick up correct thought patterns.
Other teams from top labs like LiquidAI have dug deeper into the effect of Supervised Fine-tuning and RL alone, and found out that (obviously :-)) a combination of both is what works best.
As a teacher of AI systems, and kids, I am fascinated about the process of learning.
So next week, I will show you some practical examples of how LLMs reason, so we get our hands-dirty with code!
(Mine are already dirty with sand!)
Wanna learn to build AI systems with me?
Next October 7th I will start the first cohort of my LLMOps bootcamp.
We will go through the science and engineering of building, deploying, monitoring and improving production-ready AI systems that use LLMs.
No BS.
No Marketing.
Only the things that I have learned in the last 10 years building AI systems that work.
Step by step.
Because I don’t want to give you a ready-make repo with all the code I built.
I want you to learn with me how to build systems from scratch.
So you have the tools and skills to build any AI system you want.
Enjoy the weekend,
Pau