推論モデルは思考の連鎖を制御するのに苦戦している

要旨

思考連鎖（CoT）監視は、現代の推論モデルの誤動作検出と動機理解における有望な手法です。しかし、モデルが自身のCoTで言語化する内容を制御できる場合、CoTの監視可能性を損なう可能性があります。この望ましくない能力——CoT制御性——を測定するため、CoT-Control評価スイートを導入しました。このスイートには、遺伝学の問題を「染色体」という単語を使用せずに推論するなど、CoT指示に従いながら問題を解決するタスクが含まれます。推論モデルは最終出力の制御性に比べてCoT制御性が著しく低いことが示されました。例えばClaude Sonnet 4.5は、CoTを制御できるのはわずか2.7%の場合であるのに対し、最終出力の制御では61.9%の成功率を示します。また、大規模モデルほどCoT制御性が高く、RLトレーニングの増加、推論時の計算量、問題難易度の上昇に伴って低下することも明らかになりました。CoT制御の失敗は、モデルがCoT監視を回避するインセンティブを与えられた場合（直接的要請ではなく）にも見られ、ただし監視されていることを伝えられた場合にはわずかに制御性が向上します。同様に、敵対的に最適化されたプロンプトによる制御性の誘発も、有意な向上には至りません。現在のところCoT制御性が監視可能性の失敗モードとなる可能性は低いと慎重ながら楽観的な見通しを得ました。しかし、低い制御性のメカニズムは十分に解明されていません。CoT監視可能性を維持する重要性を考慮し、フロンティア研究機関が将来のモデルにおけるCoT制御性を追跡することを推奨します。

English

Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviors and understanding the motivations of modern reasoning models. However, if models can control what they verbalize in their CoT, it could undermine CoT monitorability. To measure this undesirable capability -- CoT controllability -- we introduce the CoT-Control evaluation suite, which includes tasks that require models to solve problems while adhering to CoT instructions, e.g., reasoning about a genetics question without using the word 'chromosome'. We show that reasoning models possess significantly lower CoT controllability than output controllability; for instance, Claude Sonnet 4.5 can control its CoT only 2.7% of the time but 61.9% when controlling its final output. We also find that CoT controllability is higher for larger models and decreases with more RL training, test-time compute, and increased problem difficulty. CoT controllability failures extend even to situations in which models are given incentives (as opposed to direct requests) to evade CoT monitors, although models exhibit slightly higher controllability when they are told they are being monitored. Similarly, eliciting controllability by adversarially optimizing prompts does not meaningfully increase controllability. Our results leave us cautiously optimistic that CoT controllability is currently unlikely to be a failure mode of CoT monitorability. However, the mechanism behind low controllability is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs track CoT controllability in future models.

推論モデルは思考の連鎖を制御するのに苦戦している

Reasoning Models Struggle to Control their Chains of Thought

要旨

Support