ChatGPT의 행동은 시간이 지남에 따라 어떻게 변하고 있나요?

초록

GPT-3.5와 GPT-4는 현재 가장 널리 사용되는 대규모 언어 모델(LLM) 서비스입니다. 그러나 이러한 모델이 언제, 어떻게 업데이트되는지는 불투명합니다. 본 연구에서는 2023년 3월과 6월 버전의 GPT-3.5와 GPT-4를 네 가지 다양한 작업에 대해 평가했습니다: 1) 수학 문제 해결, 2) 민감/위험한 질문에 답변, 3) 코드 생성, 4) 시각적 추론. 우리는 GPT-3.5와 GPT-4의 성능과 행동이 시간에 따라 크게 달라질 수 있음을 발견했습니다. 예를 들어, GPT-4(2023년 3월)는 소수 식별에서 매우 뛰어난 성능(정확도 97.6%)을 보였지만, GPT-4(2023년 6월)는 동일한 질문에서 매우 낮은 성능(정확도 2.4%)을 보였습니다. 흥미롭게도 GPT-3.5(2023년 6월)는 이 작업에서 GPT-3.5(2023년 3월)보다 훨씬 더 나은 성능을 보였습니다. GPT-4는 6월에 민감한 질문에 답변하려는 의지가 3월보다 줄어들었으며, GPT-4와 GPT-3.5 모두 6월에 코드 생성에서 더 많은 형식 오류를 보였습니다. 전반적으로, 우리의 연구 결과는 동일한 LLM 서비스의 행동이 비교적 짧은 시간 내에 상당히 변화할 수 있음을 보여주며, 이는 LLM 품질에 대한 지속적인 모니터링의 필요성을 강조합니다.

English

GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on four diverse tasks: 1) solving math problems, 2) answering sensitive/dangerous questions, 3) generating code and 4) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 97.6%) but GPT-4 (June 2023) was very poor on these same questions (accuracy 2.4%). Interestingly GPT-3.5 (June 2023) was much better than GPT-3.5 (March 2023) in this task. GPT-4 was less willing to answer sensitive questions in June than in March, and both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. Overall, our findings shows that the behavior of the same LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLM quality.

ChatGPT의 행동은 시간이 지남에 따라 어떻게 변하고 있나요?

How is ChatGPT's behavior changing over time?

초록

Support