Yes, these days most LLMs have what amounts to an inner monologue that lets them produce somewhat better answers. You can also decide how much “thinking” they do before giving up.
do not seem to scale well with problem complexity. This is analogous to saying that an algorithm’s computation cost increases exponentially with problem size; it will never handle large problems. (An example from introductory CS classes is that an optimal sort algorithm scales like Nlog(N), but this is quite good scaling.) I say analogous because there isn’t the same theoretical understanding (to put it mildly) of this kind of “reasoning scaling” like there is for computational cost of algorithms. The paper criticizes the kinds of tests being done in general.
Yes. The various problems discussed each have literal algorithm solutions with varying degrees of complexity.
They also have varying degrees of output length, which is a separate issue.
The towers of Hanoi problem, for example, is extremely simple, but the solution takes exponentially more steps. And the solution is well known. And the LLMs could easily describe the algorithm, and code it. What they could not do is execute the algorithm flawlessly. At some point, in the thousands of steps, it messed up. Was it because it has some statistical error? Probably-- the models actually work in part by intentionally introducing errors (the amount of error is a parameter you set). Or maybe it was an unintentional error, ie “hallucination”. Or was it because of context window size (ie. working memory?)-- the models only have the working memory of a novella or so, before they start glitching.
The river crossing was a better example. It is complex, and I think the models failed to make much progress. I suspect, from playing with LLMs elsewhere, they weren’t suitably systematic. That’s a real issue, but it’s hard to understand exactly why they are systematic sometimes and not other times. It’s also the kind of issue that can go away when a new model is introduced.
The paper didn’t really dig into the errors that the models were making, so it’s hard to tell what went wrong. They then made some very far reaching conclusions about not really “thinking” or not “scaling” but that’s different from being bad at Towers of Hanoi.
Finally, they note that the model “overthinks” the Towers of Hanoi problem. Suggesting that there are some problems where no amount of thinking helps you. This is not really a surprise though. If you’ve ever actually done a Towers of Hanoi puzzle you know it doesn’t help to “think” about it. It’s a mindless task.