Performance of LLMs on Advent of code 2024

The article explores the performance of large language models (LLMs) on the Advent of Code 2024 challenge, where the author tested models like GPT-4o, Gemini-1.5-pro, and Claude-3-5-sonnet-20241022. The models were given the full problem descriptions and tasked with generating Python scripts to solve them. Surprisingly, the author, who did not use LLMs, performed better than the models. The experiment highlighted that LLMs struggle with never-before-seen problems, often relying on templates for familiar tasks. The models faced issues like timeout errors and exceptions, suggesting that human intervention or more advanced setups might improve their performance.

The study suggests that while LLMs excel at known problems, they falter with novel challenges, indicating a potential area for improvement in coding competitions. The author notes that the models could benefit from more efficient solutions and access to interpreters. The experiment was conducted on December 26th, ensuring that the models hadn't been trained on the challenge's solutions. The author anticipates that LLM performance on future Advent of Code challenges will improve as models evolve and potentially learn from past submissions.

Key takeaways:

LLMs did not perform as well as expected on the Advent of Code 2024 challenge, especially on never-before-seen problems.
The models were tested with both parts of each problem at once, which should have made solving easier, but they still underperformed compared to human attempts.
Timeout errors and exceptions were common in the models' submissions, indicating a need for more efficient solutions and potential human intervention for debugging.
The experiment suggests that LLMs are better at using templates for known problems rather than solving new, unseen challenges, highlighting a potential area for improvement in coding agents.

Performance of LLMs on Advent of code 2024

Key takeaways:

Comments (0)

Newsletter