From Kindergarten Math to LLM Reasoning: The Beach Challenge Problem
Last week, while my 4-year old son Kai was munching on his afternoon snack (nectarines, as usual), I decided to squeeze in a little educational moment.
I pitched him what I thought was a fun little problem:
The Beach Challenge Problem
Imagine Kai and Sofia that you both are on the shore of the beach. Sofia decides she will swim, while Kai will take a longboard and paddle instead, and go in the same direction. If Sofia swims at 2 km per hour and Kai paddles at 4 km per hour, how far apart will you be in 2 hours?Kai's response? "Daddy, can we just go to the actual beach instead?"
Fair point, kid.
Fair point.
Back to School, Back to AI
This week, Kai's back in kindergarten (thank goodness for structured learning environments), which means I needed to find someone else to teach.
Naturally, I turned to my other "students" → Large Language Models.
After all, if they can't handle a simple beach problem, how can we trust them with anything more complex?
The problem I gave Kai was basically addition with extra steps. Not exactly going to challenge GPT-4's "reasoning" capabilities.
So I made it slightly more complicated.
The (Slightly More Complicated) Beach Challenge Problem
Here's the updated version that would probably make Kai run away screaming (and honestly, might make some of us do the same):
The (slightly more complicated) Beach Challenge Problem
Kai and Sofia start at the same point on a beach. Sofia decides to swim directly toward a buoy that's 7 km offshore at a 41° angle from the shoreline. She swims at 2 km/hour, but ocean currents push her sideways at 2 km/hour perpendicular to her intended direction. Meanwhile, Kai takes his longboard and paddles along the shoreline at 5 km/hour for the first hour. After exactly 1 hour, he turns and paddles directly toward Sofia's current position at 1 km/hour (slower because he's now fighting waves). If both continue for a total of 2 hours from the start, what is the distance between them at the end?
Now we're talking!
Why This Problem Matters (Beyond Torturing AIs)
Solving this problem requires both reasoning and math. Specifically, it demands:
Trigonometry skills to track Sofia's actual path (because ocean currents are jerks like that)
Multi-step calculations for positions at different time intervals
Conditional logic to determine when and where Kai changes direction
Speed management accounting for Kai's change in velocity
Coordinate geometry to find Sofia's position when Kai starts his pursuit
Vector mathematics to calculate final positions using components
Distance formulas to find the final separation
The Real Challenge: Building Cost-Effective AI Solutions
Here's where it gets interesting for us ML engineers. Sure, we could throw this problem at GPT-4 and probably get a decent answer (for a price).
But what if you need to solve thousands of these problems?
What if you're building a tutoring system, or a mathematical reasoning benchmark, or you just really, really love geometry problems?
This is the first part of what will be an N-part series (because apparently I can't just write one blog post like a normal person) where I'll show you how to build a cost-effective and accurate system to tackle problems exactly like this one.
All the code, experiments, and probably my tears will be available in this GitHub repository as we go through this journey together.
Steps to build an optimal solution
1. Set up the tools
I used (you won't believe it) uv to encapsulate all Python project dependencies.
To get up and running, make sure you have `uv` installed in your system
curl -LsSf https://astral.sh/uv/install.sh | shIf you go to the dependencies section of the pyproject.toml file you will see something like this
[project.dependencies]
...
baml-py==0.202.1
opik>=1.8.17
...Ok, there are probably more dependencies, but these 2 are fundamental for the project:
baml-py is a token-effective way to get structured output from your LLM, which is essential if you want to build a reliable system that calls and uses LLM output.
opik is an open-source LLM evaluation tool that makes it quite easy to manage evaluation datasets, experiments and LLM application tracing.
2. Generate the evaluation dataset
Before diving into a potentially never-ending rabbit hole of agent workflows and blah blah blah, you need to establish a criteria to measure how good each solution you design is.
And for that, you need to generate a dataset of problems and solutions.
In this case, the problem can be solved exactly using a simple Python function, that encapsulates the steps that map
the initial problem quantities (e.g. Sofia's speed, Kai's speed, etc.) to
the final solution (e.g. the distance between them).
You can find it in the scripts/generate_evaluation_dataset.py file.
To generate the dataset, run:
uv run python scripts/generate_evaluation_dataset.py \
--n_problems 10 \
--dataset_name beach_challenge_problem_datasetThis will generate a dataset of 10 problems and solutions, and push it to the Opik evaluation platform.
3. Build a strong baseline
Strong generalist models like Claude Sonnet 4, GPT-5 and others are useful to build quick baseline solutions, without you having to think too hard about the problem.
In this case, the task does not depend on external data. It is all about reasoning and doing arithmetic right.
In the solve_problem.baml file you can find the BAML function that uses Claude Sonnet 4 to extract a numeric answer and the underlying reasoning the model followed.
You can find the complete agent implementation in one_shoot_agent.py
4. Evaluate the baseline solution
To evaluate the model, I added an `evaluate` method to the `GenericAgent` class, that uses the `opik` library to evaluate the model on the dataset.
I implemented two error metrics to evaluate the model:
RelativeErrorMetric that measures the relative error between the model's answer and the correct answer. The higher the error, the worse the model is.
WithinBoundsMetric that measures how close the model's answer is to the correct answer. The higher the score, the better the model is.
To run the evaluation I added a script that you can run either on the entire dataset
uv run python scripts/evaluate_agent.py \
--model anthropic/claude-sonnet-4-20250514 \
--dataset beach_challenge_problem_datasetor on a specific item of the dataset.
uv run python scripts/evaluate_agent.py \
--model anthropic/claude-sonnet-4-20250514 \
--dataset beach_challenge_problem_dataset \
--item_ids 01988960-c83a-75a3-b89b-6ba9a1b4fdf7The evaluation results are saved as an experiment run in the Opik platform.
When I run it on my end, I get something like 90% accuracy as measured by the WithinBoundsMetric.
And that is not bad, but we can do better.
Especially if you (like me) care about cost-effectiveness.
5. Are we done?
No, we are not, unless you like to burn cash and time.
I mean, your users will not be happy with a solution that takes 10 seconds to run.
And you won't be happy burning cash to deliver that.
So next week, we will try to build something better!
Ready to dive deeper into LLMOps? If building production-ready AI systems that can handle complex reasoning tasks sounds like your cup of coffee (or beach-side coconut water), check out our comprehensive LLMOps Bootcamp.
Marius Rugan and a guy called Pau will teach you everything from designing, developing and building LLM powered systems on Kubernetes.
Yes, production systems. Not toys.
Because this is what companies need (and are desperately looking for!)
Talk to you next week,
Peace and Love
Pau



