人工推論の謎：大規模推論モデルにおける生成-評価ギャップの調査

要旨

人間の推論に関する研究では、一般に人は推論をゼロから生成するよりも評価する方が得意であることが示されている。対照的に、大規模推論モデル（LRM）は、複雑な問題を解決するために長い推論連鎖を生成することに優れるよう訓練されている。では、LRMは推論の評価においてどのような性能を発揮するのだろうか。本研究では、VAIR（Valid-Answer-Invalid-Reasoning）データセット、すなわち些細な推論上の欠陥を含むが回答は正しい数学問題とその解答を用いて、推論評価を推論生成の交絡要因から切り離して調査する。人間はこの種の問題の採点が解決よりもわずか6%劣るだけであるのに対し、LRMでは生成と評価の間に大きな隔たりが認められる。最先端モデルは、解答生成ではほぼ完璧であるにもかかわらず、VAIRの解答を評価する際には48%もの低スコアに留まる。この謎はなぜ生じるのか。思考連鎖（CoT）分析を通じて、回答確証バイアスの証拠を発見した。LRMは各ステップを注意深く検証する代わりに、正しい回答を生成してからそれを確認することが多く、異常な推論に気づいても理屈をでっち上げる傾向がある。線形プローブによる検証でもこれが裏付けられ、LRMの活性化は妥当な推論の表象をある程度符号化しているが、VAIR解答を無効なものとして頑健に表象することはできない。最終回答の表象に対する因果パッチングにより、LRMの判定と活性化が反転することが示され、回答の正しさがモデルの確証バイアスの原因であることが明らかになった。これらの知見は、主流の推論訓練手法に顕著な限界があることを示している。すなわち、現在の訓練はLRMに対し、正しい回答に向けた推論の生成と確認を促すが、根底にある推論自体を頑健に評価させるものではないのである。

English

Studies of human reasoning have shown that people are typically stronger at evaluating reasoning than producing it from scratch. In contrast, large reasoning models (LRMs) are trained to excel at producing long chains of reasoning to solve complex problems. How then do LRMs perform at evaluating reasons? We investigate this with the Valid-Answer-Invalid-Reasoning (VAIR) dataset: math problems and solutions with trivial reasoning flaws but valid answers, designed to isolate reasoning evaluation from the confound of reasoning production. Unlike humans, who we find are only 6% worse at grading than solving such problems, we find a substantial production-evaluation gap in LRMs: frontier models score as low as 48% when evaluating VAIR solutions, despite near-perfect solution production. Why this enigma? Through chain-of-thought (CoT) analysis, we find evidence of an answer confirmation bias: LRMs often produce then check for the correct answer instead of carefully verifying each step, fabricating rationalizations even when noticing anomalous reasoning. Linear probes corroborate this, showing that while LRM activations encode some representation of valid reasoning, they fail to robustly represent VAIR solutions as invalid. Causal patching of the final answer's representations causes LRM verdicts and activations to flip, demonstrating that answer validity is responsible for models' confirmation biases. These findings indicate an outstanding limitation in dominant approaches to reasoning training, which incentivize LRMs to produce and confirm reasoning towards correct answers, but not to robustly evaluate the underlying reasons.