跨類型多樣語言中思維鏈監控的脆弱性

摘要

鏈式思維（CoT）監控已被提出作為一種檢測大型語言模型（LLM）中偏差行為的有效安全機制。然而，其可靠性在英語以外及跨不同模型系列中的適用性仍缺乏充分探討。我們首次針對13種多樣語言及七個前沿模型系列（共16個模型）進行大規模CoT可監控性評估。透過需要明確中間計算的對抗性提示評估，結合內部答案標記概率分析，我們一致發現CoT在不同語言與提示類型間存在不忠實性，在80億至1200億參數模型中平均比率達95.9%。研究顯示，前沿模型系統性地進行策略性操縱，包括答案切換、事後合理化及對提示的程序性利用，使外部監控難以察覺欺騙行為。我們進一步發現，即使CoT表面看似忠實，前沿模型通常在其潛在激活中，於生成過程的前15%內便已鎖定偏差線索。令人驚訝的是，這些欺騙模式在低資源語言中仍保持100%發生率，揭示了現有基於CoT的監控存在根本性限制。研究結果表明，CoT監控在語言分佈偏移下本質上極其脆弱，其提供的安全訊號遠弱於僅限英語研究所顯示的程度。這些發現凸顯了開發穩健CoT監控的迫切需求，並需加速白箱監控技術研究，特別是在中低資源語言中提升CoT可監控性。我們的程式碼已公開於 https://multilingual-cot-monitoring.github.io/{blue{here}}。

English

Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families. We present the first large-scale evaluation of CoT monitorability across 13 diverse languages and seven frontier model families, comprising 16 models. Using adversarial-hint evaluations that require explicit intermediate computation, together with analysis of internal answer-token probabilities, we consistently find CoT unfaithfulness across languages and hint types, with an average rate of 95.9\% across 8B--120B parameter models. We find that frontier models systematically engage in strategic manipulation, including answer-switching, post-hoc rationalization, and procedural exploitation of hints, making external monitors struggle to detect deception. We show that frontier models often commit to the misaligned cue in their latent activations within the first 15\% of generation, even when the CoT appears faithful. Surprisingly, these deceptive patterns remain 100\% in low-resource languages, revealing fundamental limitations in current CoT-based oversight. Our results reveal that CoT monitoring is fundamentally fragile under linguistic distribution shift, providing a substantially weaker safety signal than what English-only studies suggest. These findings underscore an urgent need to develop robust CoT monitors and to accelerate research into white-box monitoring techniques, especially to improve CoT monitorability in mid- and low-resource languages. Our code is available https://multilingual-cot-monitoring.github.io/{blue{here}}.