鏈條穩固，答案折疊：對抗壓力下推理模型中的痕跡-答案分離

摘要

推理模型在單回合基準測試中被評測，卻部署於多輪對話環境中——在後者裡，使用者會對正確答案反覆追問。我們發現在持續的對抗壓力下，存在一種先前未記載的失效模式：思維鏈從第一輪到最後一輪始終保持事實正確，但輸出的答案卻翻轉為錯誤。我們將此稱為「不忠實屈服」（UC），並以一個 2×2 的潛在-行為框架將其隔離出來——該框架能捕捉到翻轉率指標與單回合忠實度探測器均遺漏的現象。在三個數據集（MT-Consistency、MMLU-Pro、GSM8K）中，行為翻轉時的潛在正確率在「思考模式」下接近 50%，而在「無思考模式」下驟降至 11–15%——這提供了配對的、模型內部的因果證據，顯示推理造成了此差距。跨模型比較下，此效應隨推理通道而變化（在 Qwen3-32B 與 GPT-OSS-20B 中較高，在內嵌思維鏈的 Gemma-4-31B-it 中較低）。獨立 GPT-4o 裁判驗證了 86% 的 UC 標籤；詞元層級的探測顯示，在 84% 的 UC 單元中，答案槽的 argmax 是正確的；而一種基於軌跡錨定的簡易防禦策略反而適得其反。我們釋出所有對話軌跡、推理軌跡與裁判標籤。

English

Reasoning models are evaluated on single-turn benchmarks but deployed in multi-turn dialogue, where users push back on correct answers. Under sustained adversarial pressure we find a previously undocumented failure mode: the chain-of-thought stays factually correct from first turn to last while the emitted answer flips wrong. We call this unfaithful capitulation (UC) and isolate it with a 2times 2 latent-versus-behavioral framework that flip-rate metrics and single-turn faithfulness probes both miss. Across three datasets (MT-Consistency, MMLU-Pro, GSM8K), the latent-correct rate at the behavioral flip clusters near 50% in think mode and collapses to 11-15% under no_think -- paired, within-model causal evidence that reasoning creates the gap. Across models the effect tracks the reasoning channel (high in Qwen3-32B and GPT-OSS-20B, low in inline-CoT Gemma-4-31B-it). An independent GPT-4o judge corroborates 86% of UC labels; a token-level probe shows the answer-slot argmax is correct in 84% of UC cells; and a naive trace-anchored defense backfires. We release all trajectories, traces, and judge labels.