The article also provides a guide on how to reproduce the experiment using various Python scripts. The experiment involves downloading puzzles and games from Lichess, converting them to FEN (Forsyth-Edwards Notation), generating proof games, and comparing the model's performance. The author suggests potential next steps, including logging the model's rate of illegal moves, trying other models, and testing other sources of spurious features. The project has minimal dependencies and requires an OpenAI key. It is licensed under GPL v3.
Key takeaways:
- The ChessLLM project tests how sensitive language models that play chess, like GPT-3.5-turbo-instruct, are to irrelevant factors other than the position on the board.
- The model's decision-making in a given position can vary based on irrelevant factors, such as the sequence of moves leading up to the position.
- A tool called proofgame is used to construct pairs of positions, and the model's performance is tested on puzzles and reported based on both the original game and the constructed game leading to the same position.
- The project is a work in progress and future steps include logging the model's rate of illegal moves, trying other models, testing other sources of spurious features, and figuring out the implications of these experiments.