The Stanford researchers used the example of three-digit addition, where LLMs were previously judged only on accuracy, leading to a perceived sudden ability to add at a certain threshold. They retested the task using a metric that awards partial credit, showing that the ability to add is not emergent, but gradual and predictable. They argue that the improvement in LLMs as they scale up is due to the added complexity of larger models, not sudden, unpredictable jumps in ability.
Key takeaways:
- The Beyond the Imitation Game benchmark project compiled a list of tasks to test the capabilities of large language models (LLMs), and found that performance improved as the models scaled up, but with some tasks, the improvement wasn't smooth.
- Researchers have described the sudden improvement in performance as 'emergent' behavior, likening it to a phase transition in physics, but a new paper by Stanford University researchers argues that this is a consequence of how performance is measured.
- The Stanford researchers argue that the abilities of LLMs are neither unpredictable nor sudden, and that the perceived 'emergence' of abilities is more to do with the choice of measurement than the model's inner workings.
- The Stanford team tested the LLMs using a metric that awards partial credit, showing that as parameters increased, the LLMs predicted an increasingly correct sequence of digits in addition problems, suggesting that the ability to add isn't emergent but gradual and predictable.