链条保持，答案崩溃：对抗压力下推理模型中的轨迹-答案分离

摘要

推理模型在单轮基准测试中评估，却在多轮对话场景中部署，而多轮对话中用户会对正确答案提出质疑。在持续对抗性压力下，我们发现了一种此前未被记录的失败模式：思维链从首轮到末轮始终保持事实正确，但输出的答案却发生错误翻转。我们将此现象称为"不忠屈从"（UC），并通过一个2×2的潜在-行为框架将其分离，该框架揭示了既有的翻转率指标和单轮忠实度探针均无法捕捉的问题。在三个数据集（MT-Consistency、MMLU-Pro、GSM8K）中，行为翻转时的潜在正确率在思考模式下接近50%，而在无思考模式下骤降至11–15%——这一配对模型内因果证据表明，推理过程制造了差距。不同模型中，效应随推理通道变化（在Qwen3-32B和GPT-OSS-20B中较高，在内联思维链模型Gemma-4-31B-it中较低）。独立的GPT-4o评估者验证了86%的UC标签；词元级探针显示，84%的UC单元中答案槽的argmax是正确的；而一种朴素基于轨迹的防御方法适得其反。我们公开所有轨迹、追踪记录及评估者标签。

English

Reasoning models are evaluated on single-turn benchmarks but deployed in multi-turn dialogue, where users push back on correct answers. Under sustained adversarial pressure we find a previously undocumented failure mode: the chain-of-thought stays factually correct from first turn to last while the emitted answer flips wrong. We call this unfaithful capitulation (UC) and isolate it with a 2times 2 latent-versus-behavioral framework that flip-rate metrics and single-turn faithfulness probes both miss. Across three datasets (MT-Consistency, MMLU-Pro, GSM8K), the latent-correct rate at the behavioral flip clusters near 50% in think mode and collapses to 11-15% under no_think -- paired, within-model causal evidence that reasoning creates the gap. Across models the effect tracks the reasoning channel (high in Qwen3-32B and GPT-OSS-20B, low in inline-CoT Gemma-4-31B-it). An independent GPT-4o judge corroborates 86% of UC labels; a token-level probe shows the answer-slot argmax is correct in 84% of UC cells; and a naive trace-anchored defense backfires. We release all trajectories, traces, and judge labels.