推理模型難以控制其思維鏈

摘要

思維鏈監控是一項極具前景的工具，可用於檢測現代推理模型的異常行為並理解其決策動機。然而，若模型能夠控制其思維鏈中的語言表達，便可能削弱思維鏈的可監控性。為衡量這種不良能力——即思維鏈可控性——我們推出思維鏈控制評估套件，其中包含要求模型在遵循思維鏈指令的前提下解決問題的任務，例如在回答遺傳學問題時不得使用「染色體」一詞。我們發現推理模型的思維鏈可控性顯著低於輸出可控性：以Claude Sonnet 4.5為例，其思維鏈可控成功率僅2.7%，而最終輸出可控性達61.9%。研究還表明，更大規模模型的思維鏈可控性較高，但會隨強化學習訓練量、測試時計算資源及問題難度的增加而下降。即便在模型被給予規避監控的誘因（而非直接指令）時，思維鏈可控性失效現象依然存在，不過當模型知曉自身被監控時，可控性會輕微提升。同樣地，通過對抗性優化提示詞來激發可控性，並未實質性提升控制效果。這些結果使我們謹慎樂觀地認為，當前思維鏈可控性尚不太可能成為威脅可監控性的失效模式。但低可控性背後的機制尚未明確。鑑於其對維護思維鏈可監控性的重要性，我們建議前沿實驗室在未來模型中持續追蹤思維鏈可控性指標。

English

Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviors and understanding the motivations of modern reasoning models. However, if models can control what they verbalize in their CoT, it could undermine CoT monitorability. To measure this undesirable capability -- CoT controllability -- we introduce the CoT-Control evaluation suite, which includes tasks that require models to solve problems while adhering to CoT instructions, e.g., reasoning about a genetics question without using the word 'chromosome'. We show that reasoning models possess significantly lower CoT controllability than output controllability; for instance, Claude Sonnet 4.5 can control its CoT only 2.7% of the time but 61.9% when controlling its final output. We also find that CoT controllability is higher for larger models and decreases with more RL training, test-time compute, and increased problem difficulty. CoT controllability failures extend even to situations in which models are given incentives (as opposed to direct requests) to evade CoT monitors, although models exhibit slightly higher controllability when they are told they are being monitored. Similarly, eliciting controllability by adversarially optimizing prompts does not meaningfully increase controllability. Our results leave us cautiously optimistic that CoT controllability is currently unlikely to be a failure mode of CoT monitorability. However, the mechanism behind low controllability is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs track CoT controllability in future models.

推理模型難以控制其思維鏈

Reasoning Models Struggle to Control their Chains of Thought

摘要

Support