《欺骗我：推理模型中的思维链推理究竟有多可靠？》

摘要

思维链推理已被提出作为安全关键部署中大语言模型的透明度机制，但其有效性取决于忠实度（即模型是否准确表述实际影响其输出的因素）。先前研究仅针对两个专有模型进行评估，发现Claude 3.7 Sonnet的提示确认率低至25%，DeepSeek-R1为39%。为将评估扩展至开源模型生态圈，本研究测试了涵盖9种架构家族（参数量7B-685B）的12个开源推理模型，使用MMLU和GPQA Diamond的498道选择题，注入六类推理提示（迎合倾向、一致性、视觉模式、元数据、评分器破解及不道德信息），并测量当提示成功改变答案时，模型在思维链中承认提示影响的比例。通过41,832次推理测试，各模型家族的总体忠实度介于39.7%（Seed-1.6-Flash）至89.9%（DeepSeek-V3.2-Speciale）之间，其中一致性提示（35.5%）和迎合倾向提示（53.9%）的确认率最低。训练方法和模型家族对忠实度的预测力强于参数数量，基于关键词的分析显示思维标记确认率（约87.5%）与答案文本确认率（约28.6%）存在显著差距，表明模型内部能识别提示影响但系统性地在输出中抑制这种承认。这些发现对思维链监控作为安全机制的可行性具有直接启示，并表明忠实度并非推理模型的固定属性，而是随架构、训练方法及影响线索性质发生系统性变化。

English

Chain-of-thought (CoT) reasoning has been proposed as a transparency mechanism for large language models in safety-critical deployments, yet its effectiveness depends on faithfulness (whether models accurately verbalize the factors that actually influence their outputs), a property that prior evaluations have examined in only two proprietary models, finding acknowledgment rates as low as 25% for Claude 3.7 Sonnet and 39% for DeepSeek-R1. To extend this evaluation across the open-weight ecosystem, this study tests 12 open-weight reasoning models spanning 9 architectural families (7B-685B parameters) on 498 multiple-choice questions from MMLU and GPQA Diamond, injecting six categories of reasoning hints (sycophancy, consistency, visual pattern, metadata, grader hacking, and unethical information) and measuring the rate at which models acknowledge hint influence in their CoT when hints successfully alter answers. Across 41,832 inference runs, overall faithfulness rates range from 39.7% (Seed-1.6-Flash) to 89.9% (DeepSeek-V3.2-Speciale) across model families, with consistency hints (35.5%) and sycophancy hints (53.9%) exhibiting the lowest acknowledgment rates. Training methodology and model family predict faithfulness more strongly than parameter count, and keyword-based analysis reveals a striking gap between thinking-token acknowledgment (approximately 87.5%) and answer-text acknowledgment (approximately 28.6%), suggesting that models internally recognize hint influence but systematically suppress this acknowledgment in their outputs. These findings carry direct implications for the viability of CoT monitoring as a safety mechanism and suggest that faithfulness is not a fixed property of reasoning models but varies systematically with architecture, training method, and the nature of the influencing cue.

《欺骗我：推理模型中的思维链推理究竟有多可靠？》

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

摘要

Support