The author concludes with a theory that OpenAI trains its base models on datasets with more/better chess games than those used by open models, which is why all the open models are terrible at chess. Furthermore, the author suggests that recent base OpenAI models would be excellent at chess in completion mode, but the chat models that we actually get access to aren't. The author encourages further exploration and experimentation to understand this phenomenon better.
Key takeaways:
- All large language models (LLMs) are generally bad at chess, except for `gpt-3.5-turbo-instruct`, which can play at an advanced amateur level.
- The author suggests four theories for why this might be the case, including the size of the base models, the amount of chess data the model was trained on, the architecture of certain LLMs, and the competition between different types of data.
- Experiments showed that recent chat models can play chess well if prompted correctly. The author also dismisses theories that OpenAI is cheating or that LLMs can't actually play chess.
- The author's current theory is that OpenAI trains its base models on datasets with more/better chess games than those used by open models, and that recent base OpenAI models would be excellent at chess in completion mode.