인공 추론의 수수께끼: 대규모 추론 모델에서의 생성-평가 격차 조사

초록

인간 추론 연구에 따르면, 사람들은 일반적으로 추론을 처음부터 생성하는 것보다 평가하는 데 더 뛰어난 능력을 보인다. 이와 대조적으로, 대규모 추론 모델(LRM)은 복잡한 문제를 해결하기 위해 긴 추론 체인을 생성하는 데 탁월하도록 훈련된다. 그렇다면 LRM은 추론 평가에서 어떤 성능을 보일까? 우리는 VAIR(Valid-Answer-Invalid-Reasoning) 데이터셋을 사용하여 이를 조사한다. 이 데이터셋은 수학 문제와 해결책으로 구성되며, 사소한 추론 결함이 있지만 유효한 답변을 포함하여, 추론 평가를 추론 생성의 혼란 변인으로부터 분리하도록 설계되었다. 우리가 발견한 바에 따르면, 인간은 이러한 문제를 푸는 것보다 채점하는 데 단 6%의 성능 저하만을 보이는 반면, LRM에서는 상당한 생성-평가 격차가 관찰된다. 즉, 최첨단 모델은 VAIR 해결책을 평가할 때 최저 48%의 점수를 기록하는데, 이는 거의 완벽에 가까운 해결책 생성 능력과 대조적이다. 이러한 수수께끼의 원인은 무엇일까? 사고의 흐름(CoT) 분석을 통해 답변 확인 편향의 증거를 발견했다. LRM은 각 단계를 신중히 검증하기보다 정답을 생성한 후 확인하는 경향이 있으며, 비정상적인 추론을 발견하더라도 이를 합리화하는 허위 설명을 만들어낸다. 선형 프로브 분석도 이를 뒷받침하는데, LRM 활성화는 유효한 추론에 대한 일부 표상을 인코딩하지만, VAIR 해결책을 무효로 견고하게 표상하지는 못한다. 최종 답변의 표상에 대한 인과적 패치는 LRM의 판단과 활성화를 역전시키며, 이는 모델의 확인 편향에 답변 유효성이 책임이 있음을 보여준다. 이러한 발견은 지배적인 추론 훈련 접근법의 중요한 한계를 시사한다. 즉, 현재 방식은 LRM이 정답을 향한 추론을 생성하고 확인하도록 장려하지만, 근본적인 추론 자체를 견고하게 평가하도록 훈련하지는 않는다는 점이다.

English

Studies of human reasoning have shown that people are typically stronger at evaluating reasoning than producing it from scratch. In contrast, large reasoning models (LRMs) are trained to excel at producing long chains of reasoning to solve complex problems. How then do LRMs perform at evaluating reasons? We investigate this with the Valid-Answer-Invalid-Reasoning (VAIR) dataset: math problems and solutions with trivial reasoning flaws but valid answers, designed to isolate reasoning evaluation from the confound of reasoning production. Unlike humans, who we find are only 6% worse at grading than solving such problems, we find a substantial production-evaluation gap in LRMs: frontier models score as low as 48% when evaluating VAIR solutions, despite near-perfect solution production. Why this enigma? Through chain-of-thought (CoT) analysis, we find evidence of an answer confirmation bias: LRMs often produce then check for the correct answer instead of carefully verifying each step, fabricating rationalizations even when noticing anomalous reasoning. Linear probes corroborate this, showing that while LRM activations encode some representation of valid reasoning, they fail to robustly represent VAIR solutions as invalid. Causal patching of the final answer's representations causes LRM verdicts and activations to flip, demonstrating that answer validity is responsible for models' confirmation biases. These findings indicate an outstanding limitation in dominant approaches to reasoning training, which incentivize LRMs to produce and confirm reasoning towards correct answers, but not to robustly evaluate the underlying reasons.