대규모 추론 모델의 비판 메커니즘 해독

초록

대규모 추론 모델(Large Reasoning Models, LRMs)은 역추적 및 자가 검증 메커니즘을 통해 중간 단계를 수정하고 올바른 해결책에 도달하며, 복잡한 논리적 벤치마크에서 강력한 성능을 보인다. 우리는 이러한 행동이 모델이 자신의 실수를 감지할 충분히 강한 '비판' 능력을 가질 때에만 유용하다고 가정한다. 본 연구는 중간 추론 단계에 산술적 오류를 삽입함으로써 현재 LRM이 어떻게 오류로부터 회복하는지 체계적으로 조사한다. 특히, 우리는 독특하면서도 중요한 현상을 발견한다: 오류가 어떤 언어적 수정 없이 전체 사고 연쇄(Chain-of-Thought, CoT)를 통해 전파됨에도 불구하고, 모델은 사고 과정이 끝난 후에도 올바른 최종 답변에 도달한다. 이러한 회복은 모델이 오류를 감지하고 자가 수정을 촉발하는 내부 메커니즘의 존재를 시사하며, 우리는 이를 숨은 비판 능력(hidden critique ability)이라고 부른다. 특징 공간 분석을 바탕으로, 우리는 이 행동을 나타내는 해석 가능성이 높은 비판 벡터(critique vector)를 식별한다. 여러 모델 규모와 계열에 걸친 광범위한 실험은 이 벡터로 잠재 표현을 조종하면 추가 훈련 비용 없이 모델의 오류 감지 능력이 향상되고 테스트 시간 확장 성능이 개선됨을 보여준다. 본 연구 결과는 LRM의 비판 행동에 대한 귀중한 이해를 제공하며, 자가 검증 메커니즘을 제어하고 개선할 수 있는 유망한 방향을 제시한다. 우리의 코드는 다음에서 확인할 수 있다: https://github.com/mail-research/lrm-critique-vectors.

English

Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothesize that such behaviors are beneficial only when the model has sufficiently strong ``critique'' ability to detect its own mistakes. This work systematically investigates how current LRMs recover from errors by inserting arithmetic mistakes in their intermediate reasoning steps. Notably, we discover a peculiar yet important phenomenon: despite the error propagating throughout the entire chain-of-thought (CoT) without any verbalized correction, the model still reaches the correct final answer after the thinking process finishes. This recovery implies the existence of an internal mechanism helping the model to detect errors and trigger self-correction, which we refer to as the hidden critique ability. Building on feature space analysis, we identify a highly interpretable critique vector representing this behavior. Extensive experiments across multiple model scales and families demonstrate that steering latent representations with this vector improves the model's error detection capability and enhances the performance of test-time scaling at no extra training cost. Our findings provide a valuable understanding of LRMs' critique behavior, suggesting a promising direction to control and improve their self-verification mechanism. Our code is available at: https://github.com/mail-research/lrm-critique-vectors.