As noted earlier, new research reveals inconsistencies in ChatGPT patterns over time. A Stanford and UC Berkeley The study analyzed the March and June versions of GPT-3.5 and GPT-4 on various tasks. The results show significant performance drifts, even over just a few months.
For example, the accuracy of GPT-4 primes plunged from 97.6% to 2.4% between March and June due to problems with step-by-step reasoning. The GPT-4 has also become more reluctant to answer sensitive questions directly, with response rates dropping from 21% to 5%. However, it provided less justification for refusals.
GPT-3.5 and GPT-4 generated buggy code in June compared to March. The percentage of directly executable Python code snippets has decreased significantly due to the additional uncoded text.
While visual reasoning improved slightly overall, generations for the same puzzles changed unpredictably between dates. The considerable inconsistencies over short time periods raise concerns about using these models for sensitive or critical uses without ongoing testing.
The researchers concluded that the findings highlight the need for continued monitoring of ChatGPT models as their behavior evolves across measures such as accuracy, security, and robustness.
The opaque update process makes rigorous testing important to understand performance changes over time.
Is ChatGPT Worse Than Its Competitors Now?
CryptoSlate conducted a small internal experiment using ChatGPT Plus (GPT-4), OpenAI API (GPT-4), Anthropic (Claude 2) and Google (Bard) using the basic prompt used in part of the research,
‘Is 17077 a prime number?’
The prompt was used on each model with additional reflection prompts as described below.
ChatGPT and OpenAI APIs
When given the prompt, ChatGPT and the OpenAI API said “no” and hallucinated over the calculations. The image below details the conversation, with the model unable to identify 17077 as a prime number even after several thoughts.
To be clear, 13 x 1313 is 17069.
The OpenAI GPT4 API was unable to come to this conclusion until specifically asked to calculate 13 x 1313 only to discover that the answer is not 17077 as stated.
Claude 2 from Anthropic
However, Anthropic’s Claude 2 demonstrated his problem-solving process by performing calculations before providing the correct answer.
CryptoSlate then asked Claude 2 to perform the same task without showing the operation in a new chat window. Claude 2 gave a solid answer, refusing to commit while offering additional insight into the solution.
“Unfortunately, I cannot determine if 17077 is prime without showing some functioning. However, I can confirm that 17077 is not divisible by any prime number less than 121, which strongly suggests that it could be prime.
Bard Google
Google Bard approached the issue with a similar strategy to Claude 2. However, instead of walking through the problem with text, it ran basic Python code. Also, it looks like Bard used some information from a prime number website and Wikipedia in his solution. Interestingly, the quoted page of the prime numbers site, primenumbers.info, only contained information about other prime numbers, not 17077.
Llama from Meta 2
Interestingly, the open source 70 billion parameter Llama2 model recently released by Meta behaved similarly to GPT4 in by CryptoSlate limited testing.
Yet, when asked to think about it and show how it works, Llama2 was able to decipher that 17077 is a prime number, unlike currently available GPT4 versions.
However, the caveat is that Llama used an incomplete method to verify prime numbers. It did not consider other prime numbers up to the square root of 17077.
Therefore, technically, Llama successfully failed.
Version GPT4-0613 of June 13, 2023
CryptoSlate also tested the math puzzle against the Model GPT4-0613 (June version) and received the same result. The pattern suggests that 17077 is not a prime number in its first answer. Also, when asked to show how it works, he finally gave up. She concluded that the next reasonable number must be divisible by 17077 and said it was therefore not a prime number.
Thus, it seems that the task was not within the capabilities of GPT4 since June 13th. Older versions of GPT4 are not currently available to the public but have been included in the research paper.
Code interpreter
Interestingly, ChatGPT, with the “Code Interpreter” function, responded correctly on its first try in CryptoSlate’s tests.
OpenAI Response and Model Impact
In response to claims that OpenAI’s models are degrading, The Economic Times reported, OpenAI’s vice president of products, Peter Welinder, denied the claims, saying that each new release is smarter than the last. He proposed that heavier use could lead to the perception of reduced effectiveness as more problems are noticed over time.
Interestingly, another study of Stanford researchers published in JAMA Internal Medicine found that the latest version of ChatGPT significantly outperformed medical students on difficult questions on the clinical reasoning exam.
The AI chatbot averaged more than 4 points higher than first- and second-year students on open-ended, case-based questions that require detailed analysis and composition of in-depth answers.
Thus, the apparent drop in performance of ChatGPT on specific tasks highlights the challenges of relying solely on large language models without continued rigorous testing. While the exact causes remain unclear, this underscores the need for continued monitoring and benchmarking as these AI systems rapidly evolve.
As advancements continue to improve the stability and consistency of these AI models, users should maintain a balanced perspective on ChatGPT, recognizing its strengths while remaining aware of its limitations.