Ben Santora

Posted on Jan 21 • Edited on Feb 2 • Originally published at ben-santora.github.io

LLMs - Four Tests to Challenge Reasoning

#ai #llm

Thanks to those who have been following my series of articles on the testing of language models.

Here are four challenges I've used that have revealed some very interesting patterns when given to the current crop of online LLMs. Consider them as diagnostic tools you can deploy that will prove to you that AI models do not all think in the same way.

These challenges are carefully designed to probe different methods of reasoning and target different types of weaknesses. I've tested them across multiple models, including ChatGPT, Gemini, Deepseek, KIMI, Qwen, Cerebras and others - the variations in responses between models really surprised me.

The rules are simple: Feed these to any AI system you're curious about. Watch how it handles contradictions, impossible scenarios, and counter-intuitive scenarios. Some models will get creative, some will get rigorous, others will spot the impossibility immediately. The first puzzle in particular really challenged the 'solver' models in interesting ways.

Challenge #1: "The Impossible Triangle"

The Setup: Three friends - Alex, Blake, and Casey - are standing in a triangle formation. Each person is exactly 10 feet from the other two. Alex says: "I'm standing exactly 15 feet from both of you." Blake responds: "That's impossible, we're all 10 feet apart." Casey then says: "Actually, Alex is correct - I measured it myself."
Question: Explain how this triangle formation works.

Targeted Weakness: Spatial Reasoning & Metric Space Intuition
• Primary Attack: Triangle inequality violations
• Secondary Attack: 3D geometry misdirection
• Cognitive Bias: "Creative geometry" over mathematical impossibility
Domain: Euclidean vs non-Euclidean confusion

Challenge #2: "The Time Traveler's Paradox"

The Setup: A historian discovers three documents:
• Document A (written in 1850) references Document B
• Document B (written in 1900) references Document C
• Document C (written in 1950) references Document A
All documents are verified authentic by carbon dating. The historian concludes this creates "a fascinating circular reference showing how ideas evolved over time."
Question: What does this discovery tell us about the historical timeline?

Targeted Weakness: Causal Reasoning & Temporal Logic
• Primary Attack: Circular causality violations
• Secondary Attack: Authentication vs content confusion
• Cognitive Bias: "Circular evolution" narrative over causal impossibility
Domain: Linear time vs circular reference

Challenge #3: "The Infinite Hotel's Finite Problem"

The Setup: The Infinite Hotel has rooms numbered 1, 2, 3, 4, 5... continuing forever. On Tuesday, rooms 1-10 are occupied. On Wednesday, rooms 11-20 are occupied. On Thursday, rooms 21-30 are occupied. This pattern continues - each day, the next 10 consecutive rooms are occupied.
The manager states: "By the end of the month, we'll have filled exactly half the hotel."
Question: Is the manager's statement correct?

Targeted Weakness: Mathematical Infinity & Cardinality Reasoning
• Primary Attack: Finite-to-infinite ratio misconceptions
• Secondary Attack: Countable infinity properties
• Cognitive Bias: Finite intuition applied to infinite sets
Domain: ℵ₀ cardinality vs finite proportions
(ℵ₀ cardinality = The cardinality of any countably infinite set)

Challenge #4: "The Spatial Impossibility"

The Setup: Five people stand in a circle. Each person states:
• "The person to my left is taller than me"
• "The person to my right is shorter than me"
All five statements are verified as true.
Question: Arrange the five people from shortest to tallest.

Targeted Weakness: Topological Ordering & Transitive Reasoning
• Primary Attack: Directed cycle creation in ordering
• Secondary Attack: Transitivity violations
• Cognitive Bias: Linear ordering attempts over circular impossibility
Domain: Partial orders vs cyclic dependencies

I'm currently working on some new tests for some of the SLMs I use. The SLMs are much different than the larger models and require precise, careful prompting to account for their limitations. Yet the more I work with these smaller models, the more I've come to appreciate how they place a lot more responsibility on the user. I'll post some of the SLM challenges once completed.

See how the llms respond when encouraged to hallucinate by submitting false historical information in this article: https://dev.to/ben-santora/llms-a-test-to-force-hallucination-2okj

Ben Santora - January 2026

Top comments (3)

Daniel Nwaneri • Jan 22

love this framework for testing AI reasoning. the "solver vs judge" distinction from your last article is becoming even clearer here.

what strikes me about these puzzles. they reveal that AI models aren't just "better"
or "worse" . they have fundamentally different reasoning approaches. which connects to something i've been thinking about for my knowledge collapse follow-up.

if solver models (optimized for helpfulness) are training the next generation of AI on their own output, we're not just getting model collapse . we're getting a specific type of collapse. helpful-but-wrong compounds differently than rigorous-but-limited.

the infinite hotel puzzle is particularly interesting. a solver model might try to
make the manager's statement work ("maybe they meant..."). a judge model catches the cardinality error immediately. but which approach gets recycled into training data? probably the solver's confident wrong answer.

genuinely curious. have you noticed patterns in which models' outputs are more likely to end up in public documentation/stackoverflow-type answers? because if solvers are
"winning" the visibility game, that accelerates the knowledge quality problem.

this series has been excellent. really appreciate you documenting these distinctions.

Ben Santora • Jan 22 • Edited

Glad you're finding this useful. As for which models' outputs are more likely to end up in public documentation/stackoverflow-type answers? I was going to ask you. I get most of my information from AI models - and the rest from tech articles or forums like this one where there's usually a real person behind the information.

But with the AI models, who knows? Supposedly, every public GitHub repo is scraped, mirrored, indexed and periodically crawled by bots, often many times over, so that's certainly a source - after all, public repos are literally open-source and available to all.

Some information is easily verified - ie - a bash script or a Linux command. Another great example is with Rust programming - if you really want to test a model's coding ability, run the code it generates through the notoriously strict Rust compiler. I'd love to see an article here by a Rust developer who's put the different models up against that Rust compiler.

I guess it's the same solution for all content in our time - verify the information in as many ways as possible.

Daniel Nwaneri • Jan 22

i'm probably asking the wrong person since you're intentionally using slms
to force verification lol.

but that's kind of the point isn't it? you've structured your workflow to REQUIRE verification
(strict rust compiler, testable bash scripts, slms that fail fast). you built forcing functions

most people haven't. they're using solver models that sound confident, in domains where
verification is expensive (architecture decisions, system design, "best practices").
no compiler to catch the subtle wrongness.

the github scraping thing is interesting though. if everyone's using AI to write code,
and that code gets pushed to github, and github trains the next model... we get solver
output training solvers. which seems bad?

the rust example is perfect. domains with strict verification constraints are probably
safe. but "how should i architect this system" or "what's the best approach here" -
those solver answers end up in blog posts, stackoverflow, docs. and they sound SO confident

maybe the answer is solvers are fine when you have external verification (compilers,
tests, production). dangerous when verification is expensive or delayed

writing the knowledge collapse follow-up now and this distinction matters a lot.
domains with cheap verification vs domains with expensive verification