# Not All AI Skeptics Think Alike

**Author:** Philipp D. Dubach | **Published:** June 12, 2025 | **Updated:** February 23, 2026
**Categories:** AI
**Keywords:** Apple Illusion of Thinking, AI reasoning models limitations, do AI models actually reason, LLM reasoning collapse, AI reasoning evaluation methodology

## Key Takeaways

- Apple's paper found three performance regimes: standard models beat reasoning models on simple tasks, reasoning models win at medium complexity, and all models collapse at high complexity.
- Reasoning models counterintuitively reduce their thinking effort as problems approach the collapse threshold, even with token budget remaining.
- Providing explicit solution algorithms did not improve performance, with collapse occurring at the same complexity threshold regardless of guidance.
- Critics argue the Tower of Hanoi tests measure algorithm-following, not reasoning, and that models strategically refuse hundreds of sequential steps rather than failing to think.

---

Apple's recent paper "The Illusion of Thinking" has been widely understood to demonstrate that reasoning models don't 'actually' reason. Using controllable puzzle environments instead of contaminated math benchmarks, they discovered something fascinating: there are three distinct performance regimes when it comes to AI reasoning complexity. For simple problems, standard models actually outperform reasoning models while being more token-efficient. At medium complexity, reasoning models show their advantage. But at high complexity? Both collapse completely.
Here's the kicker: reasoning models exhibit counterintuitive scaling behavior—their thinking effort increases with problem complexity up to a point, then declines despite having adequate token budget. It's like watching a student give up mid-exam when the questions get too hard, even though they have plenty of time left.

>We observe that reasoning models initially increase their thinking tokens proportionally with problem complexity. However, upon approaching a critical threshold—which closely corresponds to their accuracy collapse point—models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty.

The researchers found something even more surprising: even when they provided explicit algorithms—essentially giving the models the answers—performance didn't improve. The collapse happened at roughly the same complexity threshold.


On the other hand, [Sean Goedecke](https://www.seangoedecke.com/illusion-of-thinking/) is not buying Apple's methodology: His core objection? Puzzles "require computer-like algorithm-following more than they require the kind of reasoning you need to solve math problems."

>You can't compare eight-disk to ten-disk Tower of Hanoi, because you're comparing "can the model work through the algorithm" to "can the model invent a solution that avoids having to work through the algorithm".

From his own testing, models "decide early on that hundreds of algorithmic steps are too many to even attempt, so they refuse to even start." That's strategic behavior, not reasoning failure. This matters because it shows how evaluation methodology shapes our understanding of AI capabilities. Goedecke argues Tower of Hanoi puzzles aren't useful for determining reasoning ability, and that the complexity threshold of reasoning models may not be fixed.


---

## Frequently Asked Questions

### What did Apple's "Illusion of Thinking" paper find about AI reasoning?

Apple's paper identified three distinct performance regimes for AI reasoning models: standard models outperform reasoning models on simple problems, reasoning models show their advantage at medium complexity, and all models collapse completely at high complexity. The paper also found that reasoning models counterintuitively reduce their thinking effort as problems approach the collapse threshold, even when they have token budget remaining.

### Why do AI reasoning models collapse at high complexity?

According to the research, reasoning models initially increase their thinking tokens proportionally with problem complexity but then begin reducing their reasoning effort near a critical threshold. Even when given explicit solution algorithms, performance did not improve, suggesting the collapse happens at roughly the same complexity threshold regardless of available guidance.

### Is the Tower of Hanoi a valid test of AI reasoning ability?

This is disputed. Critics like Sean Goedecke argue that puzzle-based tests like Tower of Hanoi require computer-like algorithm-following rather than the kind of reasoning needed to solve real problems. He contends that models strategically decide that hundreds of algorithmic steps are too many to attempt, which reflects strategic behavior rather than reasoning failure.

### Do AI reasoning models actually reason or just pattern match?

The answer depends on how you define reasoning and how you test for it. Apple's research suggests models fail at genuinely algorithmic tasks, but critics argue the evaluation methodology shapes the conclusions. Models may perform differently on mathematical reasoning than on sequential puzzle-solving, making the choice of benchmark a deciding factor in whether AI appears capable of reasoning.


---

*Philipp D. Dubach — [http://philippdubach.com/](http://philippdubach.com/) — 2025*