ChatGPT 的行为如何随时间变化?
How is ChatGPT's behavior changing over time?
July 18, 2023
作者: Lingjiao Chen, Matei Zaharia, James Zou
cs.AI
摘要
GPT-3.5和GPT-4是两种最广泛使用的大型语言模型(LLM)服务。然而,这些模型何时以及如何随时间更新是不透明的。在这里,我们评估了2023年3月和2023年6月版本的GPT-3.5和GPT-4在四个不同任务上的表现:1)解决数学问题,2)回答敏感/危险问题,3)生成代码和4)视觉推理。我们发现,无论是GPT-3.5还是GPT-4的性能和行为随时间变化可能会有很大差异。例如,GPT-4(2023年3月)在识别质数方面表现非常出色(准确率97.6%),但GPT-4(2023年6月)在同样的问题上表现非常糟糕(准确率2.4%)。有趣的是,GPT-3.5(2023年6月)在这个任务上比GPT-3.5(2023年3月)要好得多。GPT-4在6月份回答敏感问题时不如3月份乐意,而GPT-4和GPT-3.5在6月份生成代码时出现更多格式错误。总的来说,我们的发现表明,同一种LLM服务的行为在相对短的时间内可能会发生显著变化,凸显了对LLM质量进行持续监控的必要性。
English
GPT-3.5 and GPT-4 are the two most widely used large language model (LLM)
services. However, when and how these models are updated over time is opaque.
Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on
four diverse tasks: 1) solving math problems, 2) answering sensitive/dangerous
questions, 3) generating code and 4) visual reasoning. We find that the
performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time.
For example, GPT-4 (March 2023) was very good at identifying prime numbers
(accuracy 97.6%) but GPT-4 (June 2023) was very poor on these same questions
(accuracy 2.4%). Interestingly GPT-3.5 (June 2023) was much better than GPT-3.5
(March 2023) in this task. GPT-4 was less willing to answer sensitive questions
in June than in March, and both GPT-4 and GPT-3.5 had more formatting mistakes
in code generation in June than in March. Overall, our findings shows that the
behavior of the same LLM service can change substantially in a relatively short
amount of time, highlighting the need for continuous monitoring of LLM quality.