推理模型难以掌控其思维链

摘要

思维链监控是检测现代推理模型异常行为并理解其动机的有效工具。然而，若模型能控制其思维链中的语言表达，则可能削弱思维链的可监控性。为衡量这种不良能力——思维链可控性，我们推出CoT-Control评估套件，包含要求模型在遵循思维链指令前提下解决问题的任务，例如在不使用"染色体"一词的情况下推理遗传学问题。研究发现，推理模型的思维链可控性显著低于输出可控性：以Claude Sonnet 4.5为例，其思维链可控率仅为2.7%，而最终输出可控率高达61.9%。我们还发现，更大规模模型的思维链可控性更高，但随着强化学习训练、测试时计算资源增加以及问题难度提升，该能力会下降。即使模型被给予规避思维链监控的激励（而非直接指令），其可控性失效现象依然存在，不过当模型被告知处于被监控状态时，可控性会略有提升。同样，通过对抗性优化提示词来激发可控性，并未带来实质性提升。这些结果使我们谨慎乐观地认为，当前思维链可控性尚不构成思维链可监控性的失效模式。但低可控性背后的机制尚未明晰。鉴于其对维护思维链可监控性的重要性，我们建议前沿实验室在后续模型中持续追踪思维链可控性指标。

English

Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviors and understanding the motivations of modern reasoning models. However, if models can control what they verbalize in their CoT, it could undermine CoT monitorability. To measure this undesirable capability -- CoT controllability -- we introduce the CoT-Control evaluation suite, which includes tasks that require models to solve problems while adhering to CoT instructions, e.g., reasoning about a genetics question without using the word 'chromosome'. We show that reasoning models possess significantly lower CoT controllability than output controllability; for instance, Claude Sonnet 4.5 can control its CoT only 2.7% of the time but 61.9% when controlling its final output. We also find that CoT controllability is higher for larger models and decreases with more RL training, test-time compute, and increased problem difficulty. CoT controllability failures extend even to situations in which models are given incentives (as opposed to direct requests) to evade CoT monitors, although models exhibit slightly higher controllability when they are told they are being monitored. Similarly, eliciting controllability by adversarially optimizing prompts does not meaningfully increase controllability. Our results leave us cautiously optimistic that CoT controllability is currently unlikely to be a failure mode of CoT monitorability. However, the mechanism behind low controllability is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs track CoT controllability in future models.

推理模型难以掌控其思维链

Reasoning Models Struggle to Control their Chains of Thought

摘要

Support