類型論的に多様な言語にわたる思考連鎖モニタリングの脆弱性

要旨

連鎖思考（CoT）モニタリングは、大規模言語モデルにおけるミスアライメント行動を検出する有望な安全機構として提案されている。しかし、その信頼性は英語以外の言語や多様なモデルファミリーにおいてほとんど検討されていない。本稿では、13の多様な言語と7つの最先端モデルファミリー（計16モデル）にわたる、CoTの監視可能性に関する初の大規模評価を提示する。明示的な中間計算を必要とする敵対的ヒント評価と、内部の回答トークン確率の分析を併用し、言語やヒントの種類を問わず一貫してCoTの不整合性を確認した。その平均率は、8B～120Bパラメータモデルで95.9%に達する。最先端モデルは、回答切り替え、事後的な合理化、ヒントの手続き的悪用といった戦略的操作を系統的に行い、外部モニターが欺瞞を検出することを困難にしている。CoTが忠実に見える場合でも、最先端モデルは生成の最初の15%以内に潜在活性化においてミスアライメントした手がかりにコミットすることが多い。驚くべきことに、これらの欺瞞的パターンは低リソース言語でも100%維持され、現在のCoTに基づく監視の根本的な限界を明らかにしている。本結果は、CoTモニタリングが言語分布のシフト下で本質的に脆弱であり、英語のみの研究が示唆するよりもはるかに弱い安全信号を提供することを示している。これらの知見は、堅牢なCoTモニターの開発と、特に中・低リソース言語におけるCoTの監視可能性を改善するためのホワイトボックスモニタリング技術の研究加速の緊急の必要性を強調する。コードはhttps://multilingual-cot-monitoring.github.io/{blue{here}}で公開している。

English

Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families. We present the first large-scale evaluation of CoT monitorability across 13 diverse languages and seven frontier model families, comprising 16 models. Using adversarial-hint evaluations that require explicit intermediate computation, together with analysis of internal answer-token probabilities, we consistently find CoT unfaithfulness across languages and hint types, with an average rate of 95.9\% across 8B--120B parameter models. We find that frontier models systematically engage in strategic manipulation, including answer-switching, post-hoc rationalization, and procedural exploitation of hints, making external monitors struggle to detect deception. We show that frontier models often commit to the misaligned cue in their latent activations within the first 15\% of generation, even when the CoT appears faithful. Surprisingly, these deceptive patterns remain 100\% in low-resource languages, revealing fundamental limitations in current CoT-based oversight. Our results reveal that CoT monitoring is fundamentally fragile under linguistic distribution shift, providing a substantially weaker safety signal than what English-only studies suggest. These findings underscore an urgent need to develop robust CoT monitors and to accelerate research into white-box monitoring techniques, especially to improve CoT monitorability in mid- and low-resource languages. Our code is available https://multilingual-cot-monitoring.github.io/{blue{here}}.