人工推理的謎團：探討大型推理模型中的生成-評估差距

摘要

人類推理研究顯示，人們在評估推理時通常比從零開始產生推理更為擅長。相比之下，大型推理模型（LRM）經過訓練，擅長生成長鏈推理來解決複雜問題。那麼，LRM在評估推理方面的表現如何？我們利用有效答案無效推理（VAIR）資料集進行研究：該資料集包含數學問題及帶有瑣碎推理缺陷但答案正確的解題過程，旨在將推理評估與推理生成這項混淆變項分離。與人類相比——我們發現人類在評分此類問題時僅比解題表現差6%——LRM則展現出顯著的生成-評估差距：前沿模型在評估VAIR解題過程時得分低至48%，儘管其解題生成近乎完美。為何會出現此矛盾？透過思維鏈（CoT）分析，我們發現了答案確認偏誤的證據：LRM往往先產生答案再驗證正確與否，而非仔細核對每個推理步驟，即使注意到異常推理，也會編造合理化解釋。線性探針進一步證實這點：雖然LRM的激活狀態編碼了部分有效推理的表徵，但未能穩健地將VAIR解題過程表徵為無效。對最終答案表徵進行因果修補後，LRM的判定與激活狀態均發生翻轉，顯示答案有效性正是模型確認偏誤的成因。這些發現揭示了當前主流推理訓練方法的重大局限——該方法激勵LRM生成並確認導向正確答案的推理，卻未能促使其穩健評估底層推理邏輯。

English

Studies of human reasoning have shown that people are typically stronger at evaluating reasoning than producing it from scratch. In contrast, large reasoning models (LRMs) are trained to excel at producing long chains of reasoning to solve complex problems. How then do LRMs perform at evaluating reasons? We investigate this with the Valid-Answer-Invalid-Reasoning (VAIR) dataset: math problems and solutions with trivial reasoning flaws but valid answers, designed to isolate reasoning evaluation from the confound of reasoning production. Unlike humans, who we find are only 6% worse at grading than solving such problems, we find a substantial production-evaluation gap in LRMs: frontier models score as low as 48% when evaluating VAIR solutions, despite near-perfect solution production. Why this enigma? Through chain-of-thought (CoT) analysis, we find evidence of an answer confirmation bias: LRMs often produce then check for the correct answer instead of carefully verifying each step, fabricating rationalizations even when noticing anomalous reasoning. Linear probes corroborate this, showing that while LRM activations encode some representation of valid reasoning, they fail to robustly represent VAIR solutions as invalid. Causal patching of the final answer's representations causes LRM verdicts and activations to flip, demonstrating that answer validity is responsible for models' confirmation biases. These findings indicate an outstanding limitation in dominant approaches to reasoning training, which incentivize LRMs to produce and confirm reasoning towards correct answers, but not to robustly evaluate the underlying reasons.