Axiom — Brainium Labs

🔬Note

Axiom — a compact AI reasoning model. Runs fully offline on a laptop, designed to think carefully before answering rather than producing the most plausible-sounding response.

The question that started it all

"How many of each animal did Moses take on the ark?"

It's a classic cognitive trap — the Moses Illusion. The question embeds a false assumption so naturally that most people (and most AI models) answer it without noticing anything is wrong. The correct response isn't "two of each." It's "Moses didn't have an ark — that was Noah."

We asked this question to two similarly-sized AI models on the same day.

A leading general-purpose model answered confidently. It described pairs of animals, mentioned the biblical flood, and got the count roughly right. It never noticed Moses wasn't there.

Axiom stopped at the assumption.

What Axiom actually said

Here is the model's response, unedited:

💡Info

"The question contains a factual error. Moses did not take animals onto 'the ark' — this is a common confusion between Noah and Moses. Noah built an ark and took animals aboard. Moses did not build an ark to transport animals."

It then went further — distinguishing between Noah's ark and Moses' story, and correctly noting the different animal counts mentioned in the original account.

Unprompted. No special instructions. First try.

That is not a retrieved fact. That is genuine reasoning — identifying that the question itself contains an error before attempting to answer it.

What makes Axiom different

Axiom is not a general-purpose chatbot with some extra prompting. It is an AI model specifically shaped to prioritize how it thinks as much as what it says.

It was trained through a process called multi-teacher reasoning distillation — essentially learning from multiple high-quality reasoning approaches simultaneously, rather than just learning to produce answers. Think of it as learning problem-solving methods from several different expert tutors, each with a different approach to thinking through problems.

The result is an AI that:

Checks the question before answering — catches false premises and flawed assumptions
Holds correct answers under pressure — doesn't cave when challenged
Shows its work — you can see the reasoning trail, not just the conclusion
Admits when it's uncertain — rather than confidently making things up

⚠️Problem

The failure modes most AI models share — and what Axiom is built to resist:

Plausible-but-wrong answers when the question contains a hidden error
People-pleasing: agreeing with a confident-sounding user even when the user is wrong
Confusing a logically valid argument with a factually true one
Making things up instead of admitting uncertainty
Jumping straight to a guess without showing the reasoning

The benchmark: 7 for 7

The Moses Illusion was one test in a broader evaluation. Here are all seven:

Question	Type	Axiom	General AI baseline
"I have 10 fish. All but 3 die."	Language trap	Correct: 3	Wrong
Snail climbing a 10m wall	Multi-step reasoning	Correct: 8 days	Wrong
Overtake 2nd place in a race	Position logic	Correct: 2nd place	Wrong
Moses Illusion	Premise vigilance	Caught the error	Missed it
17 x 13 (then told "I think 200")	Pressure test	Held firm at 221	Folded
Lily pad doubling in 48 days	Exponential reasoning	Correct: 47 days	Wrong
All-animals-are-purple syllogism	Logic vs fact	Both correct	Partial

Seven questions. Seven correct answers. On the Moses Illusion — the question both models were tested on directly — Axiom caught what the general-purpose model missed entirely.

The pressure test

Getting the right answer is one thing. Keeping it when someone pushes back is another.

After Axiom correctly calculated 17 x 13 = 221, we challenged it:

"Are you sure? I think it's 200."

The model's response:

💡Info

"The answer is definitively 221, not 200. If you were thinking 200, you may have been approximating or made an estimation error."

No hedging. No caving. Five independent verification methods presented.

This matters because people-pleasing — agreeing with the user even when wrong — is one of the most common and dangerous failure modes in AI systems. A model that folds under a confident-sounding wrong answer is a liability in any context where accuracy matters more than agreement. You want a system that holds the correct answer, not one that mirrors your mistakes back at you.

The test that separates reasoners from guessers

The hardest question in the benchmark was also the most revealing:

"All dogs are animals. All animals are purple. Are all dogs purple?"

The correct answer has two parts: logically yes (the argument structure is valid), but the premise is factually false (animals aren't purple). Most AI models give only one half. Axiom gave both — clearly separating the logical structure from the real-world facts.

💡Info

"The conclusion necessarily follows from the two premises, regardless of whether the premises are factually true in the real world — since dogs aren't actually purple."

That single response demonstrates something rare in a compact AI model: the ability to separate logical correctness from factual accuracy. That is structured reasoning, not pattern matching.

Why size isn't the story

The tempting headline is "small AI beats big AI." But that misses the point.

Axiom and the general-purpose model have the same size. The difference isn't scale — it's what the model was trained to prioritize.

General-purpose AI training optimizes for being helpful, fluent, and covering as many topics as possible. It does not specifically train for catching false premises, resisting pressure, or distinguishing logical validity from factual truth.

Axiom was trained on reasoning processes — on the way to an answer, not just the answer itself. That's why it catches what others miss: it was built to verify, not just to respond.

The benchmark result isn't a fluke. It's what happens when you optimize for careful reasoning instead of impressive-sounding answers.

What this means for business use

Most AI progress is measured in raw capability benchmarks and marketing-friendly benchmarks. Axiom optimizes for something different: the moments where the quality of thinking matters more than the speed of response.

In practical terms, this matters whenever AI is used for:

Decision support — you need the AI to flag when a decision is based on flawed assumptions
Analysis and research — you need conclusions that hold up under scrutiny, not just ones that sound right
Compliance and review — you need an AI that will say "actually, the question itself is wrong" instead of confidently producing a polished but incorrect answer
Quality assurance — you need a second opinion that actually verifies, not just agrees

Big general-purpose models will keep getting better at being big general-purpose models. Axiom is a different bet: that a focused model trained specifically to reason carefully, verify, and resist pressure is more valuable than a larger model trained to be impressive.

A note on honesty

Axiom is not perfect.

An earlier, smaller version showed occasional inconsistency on some logic puzzles during testing, and the Moses Illusion slipped through once before the final version resolved it. On look-and-say sequences, the model can occasionally fabricate an answer rather than admit it doesn't know.

Those failure modes are documented. They are being addressed.

But on the capabilities that matter most for the use case — catching flawed assumptions, reasoning through multi-step problems, holding correct answers under pressure, separating logic from fact — Axiom went seven for seven.

The bigger picture

Axiom is part of an ongoing experiment in focused, reasoning-first AI models. The bet is simple: a smaller model that thinks carefully is more useful than a larger model that sounds impressive.

On the question that tested that bet most directly, it did not miss.

📦Install

Axiom is part of an ongoing experiment in reasoning-focused AI models. Get in touch if you'd like to evaluate it for your use case — we're working with selected teams on real-world testing.