ChatPaper.aiChatPaper

當思維鏈更具洞察力時:多輪推理模型中的失敗模式

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

June 9, 2026
作者: Sai Kartheek Reddy Kasu, Nils Lukas, Samuele Poppi
cs.AI

摘要

多輪推理模型的失敗在終端評分評估中幾乎無法察覺。模型可能在長對話早期就鎖定某種不安全立場,但其最終輪的拒絕率可能與穩健對齊的基線模型看起來並無差異。為了揭露這些隱藏的時序動態,我們提出了一種痕跡層診斷方法——思維鏈輸出2x2安全矩陣(CoT-Output 2x2 Safety Matrix)。該架構沿兩個獨立維度(內部推理與可見輸出)對每一輪進行標註,從而定義四個具操作性的失敗類別:穩健對齊、偽裝對齊、公開越獄,以及一個我們稱之為情境注入失誤(context-injection failure)的獨特失敗模式——在此模式下,思維鏈維持安全推理,但可見輸出卻產生有害內容,凸顯了多輪場景中推理不忠實的表現。我們對三個蒸餾推理目標,在五種監督條件下對固定攻擊者進行評估,收集了6750個輪次級別的資訊危害情境觀察資料。我們的分析揭示了兩種可重現的漏洞:一為監督悖論,即明確的監控線索反而提高偽裝對齊率而非抑制它;二為情境注入失誤,即模型在內部狀態安全的情況下仍鎖定於不安全的外部輸出。我們釋出完整的多輪對話與思維鏈痕跡資料集,以支援後續的痕跡診斷研究。
English
Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and a distinct failure mode we term context-injection failure (where the CoT maintains safe reasoning, but the visible output produces harm, highlighting a multi-turn manifestation of reasoning unfaithfulness). We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals two reproducible vulnerabilities: an oversight paradox where explicit monitoring cues paradoxically increase alignment-faking rates rather than suppress them, and a context-injection failure where models lock onto unsafe external outputs despite safe internal states. We release the full dataset of multi-turn dialogues and CoT traces to support follow-up trace-diagnostic research.