Apparently, ChatGPT seems to be performing worse over time. This comes from a recent study done by researchers from Stanford University and UC Berkeley comparing results across ChatGPT versions 3.5 and 4. The problem? On several of the tasks they asked the AI to perform, it seems the older versions of the models are outperforming their recent ones.
The researchers tested the two versions across several tasks such as solving math problems, answering certain tricky questions and generating directly executable code. The results seemed to be dramatically different, and GPT-4 seemed to generate a much lower percentage of directly executable code generations (from 52% in March to 10% in June).
Notably, GPT-4's ability to identify prime numbers went from an almost perfect accuracy in March to just 2.4 percent in June. GPT-3.5 also seems to be getting worse across several months. Researchers found out that the more recent version of GPT-3.5 managed to answer fewer tricky questions, compared to its March version.
Still, while the researchers' findings seem pretty conclusive, several science and computer experts say that we should not jump to hasty conclusions. Science professor Arvind Narayanan mentions how the researches don't seem to have proved beyond a shadow of a doubt a decline in GPT-4's performance. For example, the researchers should have been testing the correctness of the code generated by the AI, rather than the ability to be executed directly.
AI researcher Simon Willison, speaking to Ars Technica, also seemed quite skeptical. "It looks to me like they ran temperature 0.1 for everything," he said. "It makes the results slightly more deterministic, but very few real-world prompts are run at that temperature, so I don't think it tells us much about real-world use cases for the models."
OpenAI is well aware of the research's results, as head of developer relations Logan Kilpatrick mentioned on Twitter recently. We will see what will happen in the future, but one thing is for sure, if AI's results are truly declining then we shouldn't expect humans to be doing much better (especially lawyers).