Something weird is happening with LLMs and chess

The article discusses the author's experiment to test the ability of various large language models (LLMs) to play chess. The author used different models, including Llama, Qwen, command-r-v01, gemma, gpt, and o1-mini, and had them play against Stockfish, a standard chess AI. The results showed that all models performed poorly, with the exception of the gpt-3.5-turbo-instruct model, which played excellently. The author suggests four possible theories for these results, including the scale of base models, the amount of chess game training, differences in transformer architectures, and competition between different types of data.

The author also noticed a peculiar issue with tokenization, where the models performed worse when a space was included at the end of a prompt. This was attributed to the way the Llama tokenizer breaks up a string of moves, generating a space and a letter as a single token. The author suggests "token healing" as a potential solution, but instead opted to modify the grammar to allow the model to generate a space or not.

Key takeaways:

The author conducted an experiment to test the ability of various large language models (LLMs) to play chess by predicting the next move in a partially played game.
Most of the models tested, including Llama-3.2-3b, Llama-3.1-70b, Qwen-2.5-72b, and others, performed poorly, losing every game against a standard chess AI.
The only model that performed well was `gpt-3.5-turbo-instruct`, an OpenAI model, which won every game even against a higher difficulty level of the chess AI.
The author proposed several theories to explain the results, including the possibility that base models at sufficient scale can play chess, but instruction tuning destroys it, or that `gpt-3.5-turbo-instruct` was trained on more chess games.

Something weird is happening with LLMs and chess

Key takeaways:

Comments (0)

Newsletter