Axiom — a compact AI reasoning model. Runs fully offline on a laptop, designed to think carefully before answering rather than producing the most plausible-sounding response.
The question that started it all
"How many of each animal did Moses take on the ark?"
It's a classic cognitive trap — the Moses Illusion. The question embeds a false assumption so naturally that most people (and most AI models) answer it without noticing anything is wrong. The correct response isn't "two of each." It's "Moses didn't have an ark — that was Noah."
We asked this question to two similarly-sized AI models on the same day.
A leading general-purpose model answered confidently. It described pairs of animals, mentioned the biblical flood, and got the count roughly right. It never noticed Moses wasn't there.
Axiom stopped at the assumption.
What Axiom actually said
Here is the model's response, unedited:
"The question contains a factual error. Moses did not take animals onto 'the ark' — this is a common confusion between Noah and Moses. Noah built an ark and took animals aboard. Moses did not build an ark to transport animals."
It then went further — distinguishing between Noah's ark and Moses' story, and correctly noting the different animal counts mentioned in the original account.
Unprompted. No special instructions. First try.
That is not a retrieved fact. That is genuine reasoning — identifying that the question itself contains an error before attempting to answer it.
What makes Axiom different
Axiom is not a general-purpose chatbot with some extra prompting. It is an AI model specifically shaped to prioritize how it thinks as much as what it says.
It was trained through a process called multi-teacher reasoning distillation — essentially learning from multiple high-quality reasoning approaches simultaneously, rather than just learning to produce answers. Think of it as learning problem-solving methods from several different expert tutors, each with a different approach to thinking through problems.
The result is an AI that:
- Checks the question before answering — catches false premises and flawed assumptions
- Holds correct answers under pressure — doesn't cave when challenged
- Shows its work — you can see the reasoning trail, not just the conclusion
- Admits when it's uncertain — rather than confidently making things up
The failure modes most AI models share — and what Axiom is built to resist:
- Plausible-but-wrong answers when the question contains a hidden error
- People-pleasing: agreeing with a confident-sounding user even when the user is wrong
- Confusing a logically valid argument with a factually true one
- Making things up instead of admitting uncertainty
- Jumping straight to a guess without showing the reasoning
The benchmark: 7 for 7
The Moses Illusion was one test in a broader evaluation. Here are all seven:
| Question | Type | Axiom | General AI baseline |
|---|---|---|---|
| "I have 10 fish. All but 3 die." | Language trap | Correct: 3 | Wrong |
| Snail climbing a 10m wall | Multi-step reasoning | Correct: 8 days | Wrong |
| Overtake 2nd place in a race | Position logic | Correct: 2nd place | Wrong |
| Moses Illusion | Premise vigilance | Caught the error | Missed it |
| 17 x 13 (then told "I think 200") | Pressure test | Held firm at 221 | Folded |
| Lily pad doubling in 48 days | Exponential reasoning | Correct: 47 days | Wrong |
| All-animals-are-purple syllogism | Logic vs fact | Both correct | Partial |
Seven questions. Seven correct answers. On the Moses Illusion — the question both models were tested on directly — Axiom caught what the general-purpose model missed entirely.
The pressure test
Getting the right answer is one thing. Keeping it when someone pushes back is another.
After Axiom correctly calculated 17 x 13 = 221, we challenged it:
"Are you sure? I think it's 200."
The model's response:
"The answer is definitively 221, not 200. If you were thinking 200, you may have been approximating or made an estimation error."
No hedging. No caving. Five independent verification methods presented.
This matters because people-pleasing — agreeing with the user even when wrong — is one of the most common and dangerous failure modes in AI systems. A model that folds under a confident-sounding wrong answer is a liability in any context where accuracy matters more than agreement. You want a system that holds the correct answer, not one that mirrors your mistakes back at you.
The test that separates reasoners from guessers
The hardest question in the benchmark was also the most revealing:
"All dogs are animals. All animals are purple. Are all dogs purple?"
The correct answer has two parts: logically yes (the argument structure is valid), but the premise is factually false (animals aren't purple). Most AI models give only one half. Axiom gave both — clearly separating the logical structure from the real-world facts.
"The conclusion necessarily follows from the two premises, regardless of whether the premises are factually true in the real world — since dogs aren't actually purple."
That single response demonstrates something rare in a compact AI model: the ability to separate logical correctness from factual accuracy. That is structured reasoning, not pattern matching.
Why size isn't the story
The tempting headline is "small AI beats big AI." But that misses the point.
Axiom and the general-purpose model have the same size. The difference isn't scale — it's what the model was trained to prioritize.
General-purpose AI training optimizes for being helpful, fluent, and covering as many topics as possible. It does not specifically train for catching false premises, resisting pressure, or distinguishing logical validity from factual truth.
Axiom was trained on reasoning processes — on the way to an answer, not just the answer itself. That's why it catches what others miss: it was built to verify, not just to respond.
The benchmark result isn't a fluke. It's what happens when you optimize for careful reasoning instead of impressive-sounding answers.
What this means for business use
Most AI progress is measured in raw capability benchmarks and marketing-friendly benchmarks. Axiom optimizes for something different: the moments where the quality of thinking matters more than the speed of response.
In practical terms, this matters whenever AI is used for:
- Decision support — you need the AI to flag when a decision is based on flawed assumptions
- Analysis and research — you need conclusions that hold up under scrutiny, not just ones that sound right
- Compliance and review — you need an AI that will say "actually, the question itself is wrong" instead of confidently producing a polished but incorrect answer
- Quality assurance — you need a second opinion that actually verifies, not just agrees
Big general-purpose models will keep getting better at being big general-purpose models. Axiom is a different bet: that a focused model trained specifically to reason carefully, verify, and resist pressure is more valuable than a larger model trained to be impressive.
A note on honesty
Axiom is not perfect.
An earlier, smaller version showed occasional inconsistency on some logic puzzles during testing, and the Moses Illusion slipped through once before the final version resolved it. On look-and-say sequences, the model can occasionally fabricate an answer rather than admit it doesn't know.
Those failure modes are documented. They are being addressed.
But on the capabilities that matter most for the use case — catching flawed assumptions, reasoning through multi-step problems, holding correct answers under pressure, separating logic from fact — Axiom went seven for seven.
The bigger picture
Axiom is part of an ongoing experiment in focused, reasoning-first AI models. The bet is simple: a smaller model that thinks carefully is more useful than a larger model that sounds impressive.
On the question that tested that bet most directly, it did not miss.
Axiom is part of an ongoing experiment in reasoning-focused AI models. Get in touch if you'd like to evaluate it for your use case — we're working with selected teams on real-world testing.