OK, I can partly explain the LLM chess weirdness now

The article discusses the mystery of why all large language models (LLMs) are bad at chess, except for `gpt-3.5-turbo-instruct`, which can play at an advanced amateur level. The author proposes four theories to explain this and tests them using various methods, including changing the prompts, adding examples, fine-tuning, and regurgitating (repeating the whole game before giving a move). The results show that regurgitation, examples, and fine-tuning (without examples) improve the performance of the models, but none of them reach the level of `gpt-3.5-turbo-instruct`.

The author concludes with a theory that OpenAI trains its base models on datasets with more/better chess games than those used by open models, which is why all the open models are terrible at chess. Furthermore, the author suggests that recent base OpenAI models would be excellent at chess in completion mode, but the chat models that we actually get access to aren't. The author encourages further exploration and experimentation to understand this phenomenon better.

Key takeaways:

All large language models (LLMs) are generally bad at chess, except for `gpt-3.5-turbo-instruct`, which can play at an advanced amateur level.
The author suggests four theories for why this might be the case, including the size of the base models, the amount of chess data the model was trained on, the architecture of certain LLMs, and the competition between different types of data.
Experiments showed that recent chat models can play chess well if prompted correctly. The author also dismisses theories that OpenAI is cheating or that LLMs can't actually play chess.
The author's current theory is that OpenAI trains its base models on datasets with more/better chess games than those used by open models, and that recent base OpenAI models would be excellent at chess in completion mode.

OK, I can partly explain the LLM chess weirdness now

Key takeaways:

Comments (0)

Newsletter