The author also noticed a peculiar issue with tokenization, where the models performed worse when a space was included at the end of a prompt. This was attributed to the way the Llama tokenizer breaks up a string of moves, generating a space and a letter as a single token. The author suggests "token healing" as a potential solution, but instead opted to modify the grammar to allow the model to generate a space or not.
Key takeaways:
- The author conducted an experiment to test the ability of various large language models (LLMs) to play chess by predicting the next move in a partially played game.
- Most of the models tested, including Llama-3.2-3b, Llama-3.1-70b, Qwen-2.5-72b, and others, performed poorly, losing every game against a standard chess AI.
- The only model that performed well was `gpt-3.5-turbo-instruct`, an OpenAI model, which won every game even against a higher difficulty level of the chess AI.
- The author proposed several theories to explain the results, including the possibility that base models at sufficient scale can play chess, but instruction tuning destroys it, or that `gpt-3.5-turbo-instruct` was trained on more chess games.