The study supports the belief that GPT-4's performance has declined over the past few months, a claim that OpenAI has consistently denied. Some experts argue that the study's findings don't conclusively prove a decline in GPT-4's performance and could be consistent with fine-tuning adjustments made by OpenAI. For instance, the study was criticized for evaluating the immediacy of the code's ability to be executed rather than its correctness.
Key takeaways:
- A research paper from Stanford University and University of California, Berkeley suggests that the AI language model GPT-4 has become less effective at coding and compositional tasks over time.
- The researchers tested the March and June 2023 versions of GPT-4 and GPT-3.5 on tasks like math problem-solving, answering sensitive questions, code generation, and visual reasoning. They found that GPT-4's ability to identify prime numbers dropped from 97.6 percent accuracy in March to just 2.4 percent in June.
- OpenAI has denied claims that GPT-4's capabilities have decreased, with VP of Product Peter Welinder stating that each new version is smarter than the previous one.
- Princeton computer science professor Arvind Narayanan criticized the study's methodology, arguing that it didn't evaluate the correctness of the code generated by GPT-4, but rather its immediacy of execution.