鏈條穩固,答案折疊:對抗壓力下推理模型中的痕跡-答案分離
The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure
May 27, 2026
作者: Yubo Li, Ramayya Krishnan, Rema Padman
cs.AI
摘要
推理模型在單回合基準測試中被評測,卻部署於多輪對話環境中——在後者裡,使用者會對正確答案反覆追問。我們發現在持續的對抗壓力下,存在一種先前未記載的失效模式:思維鏈從第一輪到最後一輪始終保持事實正確,但輸出的答案卻翻轉為錯誤。我們將此稱為「不忠實屈服」(UC),並以一個 2×2 的潛在-行為框架將其隔離出來——該框架能捕捉到翻轉率指標與單回合忠實度探測器均遺漏的現象。在三個數據集(MT-Consistency、MMLU-Pro、GSM8K)中,行為翻轉時的潛在正確率在「思考模式」下接近 50%,而在「無思考模式」下驟降至 11–15%——這提供了配對的、模型內部的因果證據,顯示推理造成了此差距。跨模型比較下,此效應隨推理通道而變化(在 Qwen3-32B 與 GPT-OSS-20B 中較高,在內嵌思維鏈的 Gemma-4-31B-it 中較低)。獨立 GPT-4o 裁判驗證了 86% 的 UC 標籤;詞元層級的探測顯示,在 84% 的 UC 單元中,答案槽的 argmax 是正確的;而一種基於軌跡錨定的簡易防禦策略反而適得其反。我們釋出所有對話軌跡、推理軌跡與裁判標籤。
English
Reasoning models are evaluated on single-turn benchmarks but deployed in multi-turn dialogue, where users push back on correct answers. Under sustained adversarial pressure we find a previously undocumented failure mode: the chain-of-thought stays factually correct from first turn to last while the emitted answer flips wrong. We call this unfaithful capitulation (UC) and isolate it with a 2times 2 latent-versus-behavioral framework that flip-rate metrics and single-turn faithfulness probes both miss. Across three datasets (MT-Consistency, MMLU-Pro, GSM8K), the latent-correct rate at the behavioral flip clusters near 50% in think mode and collapses to 11-15% under no_think -- paired, within-model causal evidence that reasoning creates the gap. Across models the effect tracks the reasoning channel (high in Qwen3-32B and GPT-OSS-20B, low in inline-CoT Gemma-4-31B-it). An independent GPT-4o judge corroborates 86% of UC labels; a token-level probe shows the answer-slot argmax is correct in 84% of UC cells; and a naive trace-anchored defense backfires. We release all trajectories, traces, and judge labels.