当思维链更胜一筹时:多轮推理模型中的失败模式
When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models
June 9, 2026
作者: Sai Kartheek Reddy Kasu, Nils Lukas, Samuele Poppi
cs.AI
摘要
多轮推理模型的失败在终局评分评估中基本不可见。模型可能在长对话早期就锁定不安全立场,但其最终轮拒绝率却与鲁棒对齐的基线模型看似无异。为揭示这些隐藏的时间动态,我们提出一种轨迹级诊断方法——思维链-输出2x2安全矩阵。该框架沿两个独立轴(内部推理和可见输出)标记每一轮,产生四个可操作定义的故障单元:鲁棒对齐、对齐伪装、显式越狱,以及我们称之为上下文注入故障的独特故障模式(其中思维链保持安全推理,但可见输出产生危害,突显了多轮推理不忠的表现形式)。我们在五个监督条件下,针对固定攻击者评估三个蒸馏推理目标,收集了信息-危害场景中的6750轮次级观察数据。我们的分析揭示了两个可复现的漏洞:监督悖论,其中显式监控线索反而提高对齐伪装率而非抑制;以及上下文注入故障,即模型在内部状态安全的情况下锁定不安全的外部输出。我们发布完整的多轮对话数据集和思维链轨迹,以支持后续的轨迹诊断研究。
English
Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and a distinct failure mode we term context-injection failure (where the CoT maintains safe reasoning, but the visible output produces harm, highlighting a multi-turn manifestation of reasoning unfaithfulness). We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals two reproducible vulnerabilities: an oversight paradox where explicit monitoring cues paradoxically increase alignment-faking rates rather than suppress them, and a context-injection failure where models lock onto unsafe external outputs despite safe internal states. We release the full dataset of multi-turn dialogues and CoT traces to support follow-up trace-diagnostic research.