Over just a few months, ChatGPT went from correctly answering a simple math problem 98% of the time to just 2%, study finds

A Stanford University study found that the performance of high-profile A.I. chatbot ChatGPT, created by OpenAI, fluctuated significantly over several months. The study compared two versions of the technology, GPT-3.5 and GPT-4, across four tasks: solving math problems, answering sensitive questions, generating software code, and visual reasoning. Notably, GPT-4's ability to solve math problems dropped from 97.6% accuracy in March to just 2.4% in June, while GPT-3.5 improved from 7.4% to 86.8% over the same period.

The study also found that ChatGPT failed to properly explain its reasoning process, a feature that had been present in March but disappeared by June. This lack of transparency was also observed when the chatbot was asked to answer sensitive questions. The researchers concluded that while the technology may have become safer, it also provided less rationale. They emphasized the importance of continuously monitoring the models' performance over time due to these unpredictable effects and changes.

Key takeaways:

A Stanford University study found that high-profile A.I. chatbot ChatGPT performed worse on certain tasks in June than its March version.
The study found wild fluctuations, or drift, in the technology's ability to perform certain tasks, with the most notable results coming from research into GPT-4's ability to solve math problems.
James Zou, a Stanford computer science professor and one of the study’s authors, highlighted the unpredictable effects of changes in one part of the model on others, and the need to continuously monitor the models’ performance over time.
ChatGPT also stopped explaining its reasoning when answering sensitive questions, making the technology less transparent, according to the researchers.

Over just a few months, ChatGPT went from correctly answering a simple math problem 98% of the time to just 2%, study finds

Key takeaways:

Comments (0)

Newsletter