ChatGPT 的行為如何隨著時間改變？

摘要

GPT-3.5和GPT-4是兩個最廣泛使用的大型語言模型（LLM）服務。然而，這些模型何時以及如何隨時間更新是不透明的。在這裡，我們評估了GPT-3.5和GPT-4的2023年3月和6月版本在四個不同任務上的表現：1）解決數學問題，2）回答敏感/危險問題，3）生成代碼和4）視覺推理。我們發現，無論是GPT-3.5還是GPT-4的性能和行為都可能隨時間大不相同。例如，GPT-4（2023年3月）在識別質數方面表現非常出色（準確率97.6％），但GPT-4（2023年6月）對同樣的問題表現非常糟糕（準確率2.4％）。有趣的是，GPT-3.5（2023年6月）在這個任務中比GPT-3.5（2023年3月）要好得多。GPT-4在6月份回答敏感問題的意願較3月份較低，而在代碼生成方面，無論是GPT-4還是GPT-3.5在6月份的格式錯誤都比3月份多。總的來說，我們的研究結果顯示，同一個LLM服務的行為在相對短的時間內可能會發生顯著變化，突顯了對LLM質量進行持續監控的必要性。

English

GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on four diverse tasks: 1) solving math problems, 2) answering sensitive/dangerous questions, 3) generating code and 4) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 97.6%) but GPT-4 (June 2023) was very poor on these same questions (accuracy 2.4%). Interestingly GPT-3.5 (June 2023) was much better than GPT-3.5 (March 2023) in this task. GPT-4 was less willing to answer sensitive questions in June than in March, and both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. Overall, our findings shows that the behavior of the same LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLM quality.

ChatGPT 的行為如何隨著時間改變？

How is ChatGPT's behavior changing over time?

摘要

Support