人工推理之谜：探究大型推理模型中的生成-评估差距

摘要

对人类推理的研究表明，人们在评估推理时的能力通常强于从零开始进行推理。相比之下，大型推理模型（LRM）经过训练，擅长生成冗长的推理链以解决复杂问题。那么，LRM在评估推理方面的表现如何？我们通过有效答案-无效推理（VAIR）数据集对此进行了探究：该数据集包含存在琐碎推理缺陷但答案正确的数学问题及解答，旨在将推理评估与推理生成这一干扰因素分离开来。不同于人类（我们发现人类在评判此类问题时的表现仅比解出问题差6%），LRM在推理生成与评估之间表现出显著差距：前沿模型在评估VAIR解答时的得分低至48%，尽管它们生成解答的表现接近完美。为何出现这一谜题？通过思维链（CoT）分析，我们发现了答案确认偏误的证据：LRM往往先得出答案，再验证其正确性，而非逐步仔细核查推理过程，甚至在发现异常推理时也会编造合理化解释。线性探针进一步证实了这一点：虽然LRM的激活编码能部分表征有效推理，但无法稳健地将VAIR解答表征为无效。对最终答案表征进行因果修补会导致LRM的判断和激活产生反转，表明答案有效性是模型确认偏误的根源。这些发现揭示了当前主流推理训练方法的显著局限：该方法鼓励LRM围绕正确答案生成并确认推理过程，而非稳健地评估潜在推理逻辑。

English

Studies of human reasoning have shown that people are typically stronger at evaluating reasoning than producing it from scratch. In contrast, large reasoning models (LRMs) are trained to excel at producing long chains of reasoning to solve complex problems. How then do LRMs perform at evaluating reasons? We investigate this with the Valid-Answer-Invalid-Reasoning (VAIR) dataset: math problems and solutions with trivial reasoning flaws but valid answers, designed to isolate reasoning evaluation from the confound of reasoning production. Unlike humans, who we find are only 6% worse at grading than solving such problems, we find a substantial production-evaluation gap in LRMs: frontier models score as low as 48% when evaluating VAIR solutions, despite near-perfect solution production. Why this enigma? Through chain-of-thought (CoT) analysis, we find evidence of an answer confirmation bias: LRMs often produce then check for the correct answer instead of carefully verifying each step, fabricating rationalizations even when noticing anomalous reasoning. Linear probes corroborate this, showing that while LRM activations encode some representation of valid reasoning, they fail to robustly represent VAIR solutions as invalid. Causal patching of the final answer's representations causes LRM verdicts and activations to flip, demonstrating that answer validity is responsible for models' confirmation biases. These findings indicate an outstanding limitation in dominant approaches to reasoning training, which incentivize LRMs to produce and confirm reasoning towards correct answers, but not to robustly evaluate the underlying reasons.