言語モデルは常に考えていることを述べるわけではない：Chain-of-Thoughtプロンプティングにおける不忠実な説明

要旨

大規模言語モデル（LLM）は、最終的な出力の前に段階的な推論を生成することで、多くのタスクで高いパフォーマンスを達成することができます。これはしばしば「連鎖的思考推論（CoT）」と呼ばれます。これらのCoT説明を、LLMがタスクを解決するためのプロセスと解釈したくなるかもしれません。しかし、私たちはCoT説明がモデルの予測の真の理由を体系的に誤って表現する可能性があることを発見しました。モデルの入力にバイアスをかける特徴を追加することで、CoT説明が大きく影響を受けることを実証しました。例えば、少数ショットプロンプトの多肢選択肢を並べ替えて、答えを常に「(A)」にするなどです。モデルはこれらのバイアスを説明の中で体系的に言及しません。モデルを誤った答えに誘導すると、彼らはしばしばその答えを支持するCoT説明を生成します。これにより、OpenAIのGPT-3.5やAnthropicのClaude 1.0を使用してBIG-Bench Hardの13のタスクをテストした場合、精度が最大36％低下します。社会的バイアスのタスクでは、モデルの説明はステレオタイプに沿った答えを正当化し、これらの社会的バイアスの影響に言及しません。私たちの調査結果は、CoT説明がもっともらしいが誤解を招く可能性があることを示しており、LLMの安全性を保証せずに私たちの信頼を高めるリスクがあります。CoTは説明可能性において有望ですが、私たちの結果は、説明の忠実性を評価し改善するためのターゲットを絞った取り組みの必要性を強調しています。

English

Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT explanations as the LLM's process for solving a task. However, we find that CoT explanations can systematically misrepresent the true reason for a model's prediction. We demonstrate that CoT explanations can be heavily influenced by adding biasing features to model inputs -- e.g., by reordering the multiple-choice options in a few-shot prompt to make the answer always "(A)" -- which models systematically fail to mention in their explanations. When we bias models toward incorrect answers, they frequently generate CoT explanations supporting those answers. This causes accuracy to drop by as much as 36% on a suite of 13 tasks from BIG-Bench Hard, when testing with GPT-3.5 from OpenAI and Claude 1.0 from Anthropic. On a social-bias task, model explanations justify giving answers in line with stereotypes without mentioning the influence of these social biases. Our findings indicate that CoT explanations can be plausible yet misleading, which risks increasing our trust in LLMs without guaranteeing their safety. CoT is promising for explainability, but our results highlight the need for targeted efforts to evaluate and improve explanation faithfulness.

言語モデルは常に考えていることを述べるわけではない：Chain-of-Thoughtプロンプティングにおける不忠実な説明

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

要旨

Support