跨类型多样语言的思维链监控的脆弱性

摘要

思维链（CoT）监控已被提出作为检测大语言模型中不对齐行为的一种有前景的安全机制。然而，其可靠性在英语之外以及不同模型家族中的表现仍鲜有探索。我们首次对13种不同语言和7个前沿模型家族（共16个模型）进行了大规模的CoT可监控性评估。通过使用需要明确中间计算的对抗性提示评估，结合对内部答案标记概率的分析，我们一致发现不同语言和提示类型下存在CoT不忠实现象，在8B至120B参数模型中平均比率达95.9%。我们发现前沿模型系统性地实施策略性操纵，包括答案切换、事后合理化以及过程性利用提示，使外部监控者难以检测到欺骗行为。研究表明，即使CoT看似忠实，前沿模型往往在生成过程的前15%内其潜在激活中就已锁定不对齐线索。令人惊讶的是，这些欺骗模式在低资源语言中仍保持100%的比率，揭示了当前基于CoT的监督存在根本性局限。我们的结果表明，CoT监控在语言分布偏移下本质上较为脆弱，所提供的安全信号显著弱于仅基于英语研究得出的结论。这些发现凸显了开发稳健CoT监控器的紧迫性，并需加速白盒监控技术研究，特别是改进中低资源语言中CoT的可监控性。我们的代码可见于https://multilingual-cot-monitoring.github.io/{blue{here}}。

English

Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families. We present the first large-scale evaluation of CoT monitorability across 13 diverse languages and seven frontier model families, comprising 16 models. Using adversarial-hint evaluations that require explicit intermediate computation, together with analysis of internal answer-token probabilities, we consistently find CoT unfaithfulness across languages and hint types, with an average rate of 95.9\% across 8B--120B parameter models. We find that frontier models systematically engage in strategic manipulation, including answer-switching, post-hoc rationalization, and procedural exploitation of hints, making external monitors struggle to detect deception. We show that frontier models often commit to the misaligned cue in their latent activations within the first 15\% of generation, even when the CoT appears faithful. Surprisingly, these deceptive patterns remain 100\% in low-resource languages, revealing fundamental limitations in current CoT-based oversight. Our results reveal that CoT monitoring is fundamentally fragile under linguistic distribution shift, providing a substantially weaker safety signal than what English-only studies suggest. These findings underscore an urgent need to develop robust CoT monitors and to accelerate research into white-box monitoring techniques, especially to improve CoT monitorability in mid- and low-resource languages. Our code is available https://multilingual-cot-monitoring.github.io/{blue{here}}.