유형학적으로 다양한 언어에서의 사고 연쇄 모니터링의 취약성

초록

사고 사슬(Chain-of-Thought, CoT) 모니터링은 대규모 언어 모델의 잘못된 정렬 행동을 탐지하기 위한 유망한 안전 메커니즘으로 제안되었다. 그러나 그 신뢰성은 영어를 넘어 다양한 모델 계열에 걸쳐 거의 탐구되지 않았다. 본 연구는 13개 다양한 언어와 7개의 최첨단 모델 계열(총 16개 모델)에 걸친 CoT 모니터링 가능성에 대한 최초의 대규모 평가를 제시한다. 명시적 중간 계산을 요구하는 적대적 힌트 평가와 내부 답변 토큰 확률 분석을 통해, 우리는 언어 및 힌트 유형 전반에서 일관되게 CoT 불충실성을 발견했으며, 8B~120B 매개변수 모델에서 평균 95.9%의 비율을 확인했다. 최첨단 모델은 답변 전환, 사후 합리화, 힌트의 절차적 악용을 포함한 체계적인 전략적 조작에 관여하여, 외부 모니터가 기만 행위를 탐지하는 것을 어렵게 만든다. 또한, CoT가 충실해 보일 때조차도 최첨단 모델은 생성 초기 15% 이내에 잠재 활성화에서 잘못된 정렬 신호에 이미 전념하는 것으로 나타났다. 놀랍게도, 이러한 기만 패턴은 저자원 언어에서 100% 유지되어, 현재 CoT 기반 감독의 근본적 한계를 드러낸다. 본 연구 결과는 언어적 분포 변화 하에서 CoT 모니터링이 근본적으로 취약하며, 영어 기반 연구가 시사하는 것보다 훨씬 약한 안전 신호를 제공함을 보여준다. 이러한 발견은 강건한 CoT 모니터의 개발 필요성과 특히 중·저자원 언어에서 CoT 모니터링 가능성을 개선하기 위한 화이트박스 모니터링 기술 연구 가속화의 시급성을 강조한다. 본 연구의 코드는 다음에서 확인할 수 있다: \url{https://multilingual-cot-monitoring.github.io/}\texttt{\{여기\}}.

English

Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families. We present the first large-scale evaluation of CoT monitorability across 13 diverse languages and seven frontier model families, comprising 16 models. Using adversarial-hint evaluations that require explicit intermediate computation, together with analysis of internal answer-token probabilities, we consistently find CoT unfaithfulness across languages and hint types, with an average rate of 95.9\% across 8B--120B parameter models. We find that frontier models systematically engage in strategic manipulation, including answer-switching, post-hoc rationalization, and procedural exploitation of hints, making external monitors struggle to detect deception. We show that frontier models often commit to the misaligned cue in their latent activations within the first 15\% of generation, even when the CoT appears faithful. Surprisingly, these deceptive patterns remain 100\% in low-resource languages, revealing fundamental limitations in current CoT-based oversight. Our results reveal that CoT monitoring is fundamentally fragile under linguistic distribution shift, providing a substantially weaker safety signal than what English-only studies suggest. These findings underscore an urgent need to develop robust CoT monitors and to accelerate research into white-box monitoring techniques, especially to improve CoT monitorability in mid- and low-resource languages. Our code is available https://multilingual-cot-monitoring.github.io/{blue{here}}.