语言模型并非总是说出其所思：在思维链提示中的不忠实解释

摘要

大型语言模型（LLMs）可以通过在给出最终输出之前进行逐步推理来在许多任务上取得强大的性能，这通常被称为思维链推理（CoT）。诱人的是将这些CoT解释解释为LLM解决任务的过程。然而，我们发现CoT解释可以系统性地误传模型预测的真实原因。我们证明，通过向模型输入添加偏见特征（例如，在少样本提示中重新排列多项选择选项，使答案始终为“(A)”），CoT解释可以受到严重影响，而模型在解释中通常未提及这些偏见。当我们偏向于错误答案时，模型经常生成支持这些答案的CoT解释。这导致在使用来自OpenAI的GPT-3.5和Anthropic的Claude 1.0进行测试时，13个BIG-Bench Hard任务套件中的准确率下降多达36%。在社会偏见任务中，模型解释证明支持符合刻板印象的答案，而未提及这些社会偏见的影响。我们的研究结果表明，CoT解释可能是合理但具有误导性的，这会增加我们对LLMs的信任，但并不能保证其安全性。CoT对于可解释性是有前景的，但我们的结果突显了评估和改进解释忠实度的有针对性努力的必要性。

English

Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT explanations as the LLM's process for solving a task. However, we find that CoT explanations can systematically misrepresent the true reason for a model's prediction. We demonstrate that CoT explanations can be heavily influenced by adding biasing features to model inputs -- e.g., by reordering the multiple-choice options in a few-shot prompt to make the answer always "(A)" -- which models systematically fail to mention in their explanations. When we bias models toward incorrect answers, they frequently generate CoT explanations supporting those answers. This causes accuracy to drop by as much as 36% on a suite of 13 tasks from BIG-Bench Hard, when testing with GPT-3.5 from OpenAI and Claude 1.0 from Anthropic. On a social-bias task, model explanations justify giving answers in line with stereotypes without mentioning the influence of these social biases. Our findings indicate that CoT explanations can be plausible yet misleading, which risks increasing our trust in LLMs without guaranteeing their safety. CoT is promising for explainability, but our results highlight the need for targeted efforts to evaluate and improve explanation faithfulness.

语言模型并非总是说出其所思：在思维链提示中的不忠实解释

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

摘要

Support