Apple’s recent paper “The Illusion of Thinking” has been widely understood to demonstrate that reasoning models don’t ‘actually’ reason. Using controllable puzzle environments instead of contaminated math benchmarks, they discovered something fascinating: there are three distinct performance regimes when it comes to AI reasoning complexity. For simple problems, standard models actually outperform reasoning models while being more token-efficient. At medium complexity, reasoning models show their advantage. But at high complexity? Both collapse completely. Here’s the kicker: reasoning models exhibit counterintuitive scaling behavior—their thinking effort increases with problem complexity up to a point, then declines despite having adequate token budget. It’s like watching a student give up mid-exam when the questions get too hard, even though they have plenty of time left.

We observe that reasoning models initially increase their thinking tokens proportionally with problem complexity. However, upon approaching a critical threshold—which closely corresponds to their accuracy collapse point—models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty.

The researchers found something even more surprising: even when they provided explicit algorithms—essentially giving the models the answers—performance didn’t improve. The collapse happened at roughly the same complexity threshold. On the other hand Sean Goedecke is not buying Apple’s methodology: His core objection? Puzzles “require computer-like algorithm-following more than they require the kind of reasoning you need to solve math problems.”

You can’t compare eight-disk to ten-disk Tower of Hanoi, because you’re comparing “can the model work through the algorithm” to “can the model invent a solution that avoids having to work through the algorithm”.

From his own testing, models “decide early on that hundreds of algorithmic steps are too many to even attempt, so they refuse to even start.“That’s strategic behavior, not reasoning failure. This matters because it shows how evaluation methodology shapes our understanding of AI capabilities. Goedecke argues Tower of Hanoi puzzles aren’t useful for determining reasoning ability, and that the complexity threshold of reasoning models may not be fixed.