The author concludes that not much has changed in terms of the model's ability to solve complex problems, suggesting that LLMs may have reached their plateau. The author also notes that ChatGPT lacks debugging skills and does not have a "world model" or logical understanding of programming. The author speculates that the model's performance could be due to overfitting, where it has memorized answers to a bunch of standardized tests, and suggests that OpenAI might have a secret test dataset to avoid training set contamination.
Key takeaways:
- The author conducted an experiment to test how well ChatGPT-4 performs against Advent of Code 2023, a programming event where participants solve problems that unlock daily.
- ChatGPT-4's performance was slightly worse than GPT-3.5's performance the previous year, solving only 2 full days compared to GPT-3.5's 3 days. However, ChatGPT Plus performed slightly better, solving 4 days on its own.
- The author suggests that ChatGPT's inability to debug its flawed solutions indicates that it doesn't have a "world model" or a logical understanding of what it's doing when it's programming.
- The author speculates that the reason GPT-4 didn't perform much better on Advent of Code could be due to overfitting, where it has simply memorized the answers to a bunch of standardized tests.