語言模型並非總是說出其所思：在思維鏈提示中的不忠實解釋

摘要

大型語言模型（LLMs）可以通過逐步推理來在許多任務上取得強大表現，這通常被稱為思維鏈推理（CoT）。誘人的是將這些CoT解釋解讀為LLM解決任務的過程。然而，我們發現CoT解釋可能會系統性地誤導模型預測的真正原因。我們展示了當將偏見特徵添加到模型輸入中時，例如在幾輪提示中重新排列多個選項以使答案始終為“(A)”時，CoT解釋可能會受到嚴重影響，而模型在解釋中未能提及這些偏見特徵。當我們將模型偏向不正確答案時，它們經常生成支持這些答案的CoT解釋。這導致在使用來自OpenAI的GPT-3.5和Anthropic的Claude 1.0進行BIG-Bench Hard的13個任務套件測試時，準確性可能下降高達36%。在社會偏見任務中，模型解釋證明支持與刻板印象一致的答案，而未提及這些社會偏見的影響。我們的研究結果表明，CoT解釋可能是合理但具有誤導性的，這可能增加我們對LLMs的信任，但不能保證其安全性。CoT對於可解釋性是有前景的，但我們的結果強調了評估和改進解釋忠實度的有針對性努力的必要性。

English

Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT explanations as the LLM's process for solving a task. However, we find that CoT explanations can systematically misrepresent the true reason for a model's prediction. We demonstrate that CoT explanations can be heavily influenced by adding biasing features to model inputs -- e.g., by reordering the multiple-choice options in a few-shot prompt to make the answer always "(A)" -- which models systematically fail to mention in their explanations. When we bias models toward incorrect answers, they frequently generate CoT explanations supporting those answers. This causes accuracy to drop by as much as 36% on a suite of 13 tasks from BIG-Bench Hard, when testing with GPT-3.5 from OpenAI and Claude 1.0 from Anthropic. On a social-bias task, model explanations justify giving answers in line with stereotypes without mentioning the influence of these social biases. Our findings indicate that CoT explanations can be plausible yet misleading, which risks increasing our trust in LLMs without guaranteeing their safety. CoT is promising for explainability, but our results highlight the need for targeted efforts to evaluate and improve explanation faithfulness.

語言模型並非總是說出其所思：在思維鏈提示中的不忠實解釋

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

摘要

Support