大規模推論モデルにおける批判メカニズムの解読

要旨

大規模推論モデル（LRM）は、バックトラッキングや自己検証のメカニズムを備えており、中間ステップを修正して正しい解に到達することができ、複雑な論理ベンチマークにおいて高い性能を示す。本研究では、このような振る舞いは、モデルが自身の誤りを検出するのに十分に強力な「批判」能力を持つ場合にのみ有益であると仮定する。本稿では、中間推論ステップに算術誤りを挿入することにより、現在のLRMがエラーからどのように回復するかを体系的に調査する。注目すべきことに、我々は特異かつ重要な現象を発見した。誤りが思考連鎖全体に言語化された修正なしに伝播するにもかかわらず、思考プロセス終了後にモデルが正しい最終回答に到達するのである。この回復は、モデルが誤りを検出し自己修正を誘発する内部メカニズムの存在を示唆しており、我々はこれを隠れた批判能力と名付ける。特徴空間分析に基づき、この振る舞いを表す高い解釈可能性を持つ批判ベクトルを特定する。複数のモデルスケールとファミリーにわたる広範な実験により、このベクトルで潜在表現を操作することで、追加の学習コストなしにモデルの誤り検出能力が向上し、テスト時スケーリングの性能が改善されることが示された。本知見はLRMの批判行動の理解を深め、自己検証メカニズムを制御・改善する有望な方向性を示すものである。コードはhttps://github.com/mail-research/lrm-critique-vectorsで公開している。

English

Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothesize that such behaviors are beneficial only when the model has sufficiently strong ``critique'' ability to detect its own mistakes. This work systematically investigates how current LRMs recover from errors by inserting arithmetic mistakes in their intermediate reasoning steps. Notably, we discover a peculiar yet important phenomenon: despite the error propagating throughout the entire chain-of-thought (CoT) without any verbalized correction, the model still reaches the correct final answer after the thinking process finishes. This recovery implies the existence of an internal mechanism helping the model to detect errors and trigger self-correction, which we refer to as the hidden critique ability. Building on feature space analysis, we identify a highly interpretable critique vector representing this behavior. Extensive experiments across multiple model scales and families demonstrate that steering latent representations with this vector improves the model's error detection capability and enhances the performance of test-time scaling at no extra training cost. Our findings provide a valuable understanding of LRMs' critique behavior, suggesting a promising direction to control and improve their self-verification mechanism. Our code is available at: https://github.com/mail-research/lrm-critique-vectors.