Are LLMs general reasoners?

  • Not just LLMs: You can substitute LLM for foundation models or general purpose transformers
  • If you define “general reasoning” as “thinking carefully and generalising that thinking to any novel domains” then yes, I think current LLMs are definitely capable of it.
  • If you mean any novel domains that humans can generalise to, I lean to yes for more advanced LLMs.
  • If you mean generalisation to literally everything ever, then no, humans are not general reasoners either.

Some evidence for LLM reasoning capabilities

Evidence against LLM reasoning capabilities

  • Models fail at some simple OOD stuff (some interesting failures are at anti-riddles like “two pairs of father-son and 4 fish” instead of 3) (do check this out though https://arxiv.org/abs/2402.10200)
  • Some examples of inverse scaling where small models answer normally and larger models mimic cliches / common misconceptions
  • https://arxiv.org/abs/2405.15071 shows that small transformers can approximate algebraic structures such as composition but fail OOD, and they did some interp to figure why this was.
  • I would be surprised and update as such if straightforward modifications do not significantly increase performance on similar tasks. by straightforward im broadly gesturing at scaling / scaffolding / modular stuff like decoding / combinations of all above (Dec 2024 update: o3)
  • I would also be convinced otherwise if someone demonstrates a theoretical limit on transformers’ (or anything you consider LLMs) OOD generalisation capabilities and ~1 decade worth of today’s ML field’s effort cannot make significant progress. I don’t think this would be practically useful though.

More broadly

  • My feeling is that LLMs are soup, a bundle of memorisation, shallow heuristics, and generalised circuits (probably related to shard theory or something)
  • Evidences against don’t seem to be fundamental problems that scaling or just improved training techniques won’t bypass

Why this matters

What to do

  • I’m doing a project at AI safety camp to gain more clarity on LLM automated RnD and general reasoning: https://forum.effectivealtruism.org/posts/2qJMDADRCPrkftRgb/ai-safety-camp-10#_8__LLMs__Can_They_Science_

    • There are lots of other interesting evals ideas to work on here:

      • How does tooling generally affect model performance(improvements, degradations from bad tooling)?
      • Can you make a super simple scaffolding scheme and get significant capability gains? Simple memory summarisations + notepad, optimising prompts for successors, etc
      • Multi-agents? Large scale cooperation (on the order of millions of instances) have been found possible
  • Put more effort into developing tools / frameworks for effective evals/control/interp of advanced LLMs. There are lots of useful libraries for interp, evals has inspect, but lots of rooms left for building