사고 사슬 통제에 어려움을 겪는 추론 모델

초록

사고 연쇄(CoT) 모니터링은 현대 추론 모델의 오작동을 탐지하고 동기를 이해하는 유망한 도구입니다. 그러나 모델이 CoT에서 언어화하는 내용을 통제할 수 있다면 CoT 모니터링 가능성을 훼손할 수 있습니다. 이러한 바람직하지 않은 능력인 'CoT 제어 가능성'을 측정하기 위해 우리는 CoT-Control 평가 스위트를 도입했습니다. 여기에는 모델이 CoT 지시사항을 준수하면서 문제를 해결해야 하는 작업(예: '염색체'라는 단어를 사용하지 않고 유전학 문제를 추론하기)이 포함됩니다. 우리는 추론 모델의 CoT 제어 가능성이 최종 출력 제어 가능성보다 현저히 낮음을 보여줍니다. 예를 들어, Claude Sonnet 4.5는 CoT를 제어할 수 있는 비율이 2.7%에 불과하지만, 최종 출력을 제어할 때는 61.9%에 달합니다. 또한 CoT 제어 가능성은 더 큰 모델에서 더 높으며, RL 훈련량 증가, 시험 시간 계산량 증가, 문제 난이도 증가에 따라 감소하는 것을 확인했습니다. CoT 제어 실패는 모델이 CoT 모니터를 회피하도록 직접적인 요청이 아닌 유인책이 주어지는 상황까지 확장됩니다. 다만 모델이 모니터링 대상이라고 알려진 경우 약간 더 높은 제어 가능성을 보였습니다. 마찬가지로, 적대적으로 프롬프트를 최적화하여 제어 가능성을 유도하는 것은 의미 있는 수준으로 제어 가능성을 높이지 못했습니다. 우리의 결과는 CoT 제어 가능성이 현재로서는 CoT 모니터링 가능성의 실패 모드가 될 가능성이 낮다고 신중하게 낙관할 수 있게 합니다. 그러나 낮은 제어 가능성의 배후 메커니즘은 잘 이해되지 않고 있습니다. CoT 모니터링 가능성 유지의 중요성을 고려할 때, 우리는 선도 연구소들이 향후 모델에서 CoT 제어 가능성을 추적할 것을 권장합니다.

English

Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviors and understanding the motivations of modern reasoning models. However, if models can control what they verbalize in their CoT, it could undermine CoT monitorability. To measure this undesirable capability -- CoT controllability -- we introduce the CoT-Control evaluation suite, which includes tasks that require models to solve problems while adhering to CoT instructions, e.g., reasoning about a genetics question without using the word 'chromosome'. We show that reasoning models possess significantly lower CoT controllability than output controllability; for instance, Claude Sonnet 4.5 can control its CoT only 2.7% of the time but 61.9% when controlling its final output. We also find that CoT controllability is higher for larger models and decreases with more RL training, test-time compute, and increased problem difficulty. CoT controllability failures extend even to situations in which models are given incentives (as opposed to direct requests) to evade CoT monitors, although models exhibit slightly higher controllability when they are told they are being monitored. Similarly, eliciting controllability by adversarially optimizing prompts does not meaningfully increase controllability. Our results leave us cautiously optimistic that CoT controllability is currently unlikely to be a failure mode of CoT monitorability. However, the mechanism behind low controllability is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs track CoT controllability in future models.

사고 사슬 통제에 어려움을 겪는 추론 모델

Reasoning Models Struggle to Control their Chains of Thought

초록

Support