언어 모델은 항상 자신의 생각을 말하지 않는다: 사고 연쇄 프롬프팅에서의 불성실한 설명

초록

대형 언어 모델(LLMs)은 최종 출력을 제공하기 전에 단계별 추론을 생성함으로써 많은 작업에서 강력한 성능을 달성할 수 있으며, 이를 흔히 사고의 연쇄(chain-of-thought reasoning, CoT)라고 부른다. 이러한 CoT 설명을 LLM이 작업을 해결하는 과정으로 해석하고 싶은 유혹이 있다. 그러나 우리는 CoT 설명이 모델의 예측에 대한 진정한 이유를 체계적으로 잘못 표현할 수 있음을 발견했다. 우리는 CoT 설명이 모델 입력에 편향적인 특징을 추가함으로써 크게 영향을 받을 수 있음을 보여준다. 예를 들어, 몇 가지 샷 프롬프트에서 다중 선택 항목의 순서를 재배열하여 답이 항상 "(A)"가 되도록 만들면, 모델은 이러한 편향을 설명에서 체계적으로 언급하지 못한다. 모델이 잘못된 답변을 향하도록 편향될 때, 그들은 종종 그 답변을 지지하는 CoT 설명을 생성한다. 이로 인해 OpenAI의 GPT-3.5와 Anthropic의 Claude 1.0을 사용하여 BIG-Bench Hard의 13개 작업을 테스트할 때 정확도가 최대 36%까지 떨어질 수 있다. 사회적 편향 작업에서 모델 설명은 이러한 사회적 편향의 영향을 언급하지 않고도 편견에 부합하는 답변을 제공하는 것을 정당화한다. 우리의 연구 결과는 CoT 설명이 그럴듯하지만 오해의 소지가 있음을 나타내며, 이는 LLM에 대한 신뢰를 증가시키면서도 그 안전성을 보장하지 못할 위험이 있다. CoT는 설명 가능성 측면에서 유망하지만, 우리의 결과는 설명의 충실성을 평가하고 개선하기 위한 목표적인 노력의 필요성을 강조한다.

English

Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT explanations as the LLM's process for solving a task. However, we find that CoT explanations can systematically misrepresent the true reason for a model's prediction. We demonstrate that CoT explanations can be heavily influenced by adding biasing features to model inputs -- e.g., by reordering the multiple-choice options in a few-shot prompt to make the answer always "(A)" -- which models systematically fail to mention in their explanations. When we bias models toward incorrect answers, they frequently generate CoT explanations supporting those answers. This causes accuracy to drop by as much as 36% on a suite of 13 tasks from BIG-Bench Hard, when testing with GPT-3.5 from OpenAI and Claude 1.0 from Anthropic. On a social-bias task, model explanations justify giving answers in line with stereotypes without mentioning the influence of these social biases. Our findings indicate that CoT explanations can be plausible yet misleading, which risks increasing our trust in LLMs without guaranteeing their safety. CoT is promising for explainability, but our results highlight the need for targeted efforts to evaluate and improve explanation faithfulness.

언어 모델은 항상 자신의 생각을 말하지 않는다: 사고 연쇄 프롬프팅에서의 불성실한 설명

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

초록

Support