解碼大型推理模型中的評審機制

摘要

大型推理模型（LRMs）具備回溯與自我驗證機制，使其能修正中間步驟並得出正確解答，從而在複雜邏輯基準測試中展現優異表現。我們假設這類行為僅在模型擁有足夠強大的「批判」能力來偵測自身錯誤時才有助益。本研究系統性地探討當前LRMs如何透過在推理中間步驟中插入算術錯誤來從失誤中恢復。值得注意的是，我們發現一個特殊且重要的現象：儘管錯誤在整個思維鏈（CoT）中持續傳播而未出現任何口頭修正，模型在思考過程結束後仍能得出正確的最終答案。此恢復現象暗示存在一種內部機制幫助模型偵測錯誤並觸發自我修正，我們稱之為隱藏批判能力。基於特徵空間分析，我們辨識出一個高度可解釋的批判向量來表徵此行為。跨越多個模型規模與系列的廣泛實驗證明，使用此向量操縱潛在表徵可提升模型的錯誤偵測能力，並在無額外訓練成本下強化測試時擴展的性能。我們的研究成果提供對LRMs批判行為的寶貴理解，為控制與改進其自我驗證機制指出一條前景可期的方向。我們的程式碼開源於：https://github.com/mail-research/lrm-critique-vectors。

English

Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothesize that such behaviors are beneficial only when the model has sufficiently strong ``critique'' ability to detect its own mistakes. This work systematically investigates how current LRMs recover from errors by inserting arithmetic mistakes in their intermediate reasoning steps. Notably, we discover a peculiar yet important phenomenon: despite the error propagating throughout the entire chain-of-thought (CoT) without any verbalized correction, the model still reaches the correct final answer after the thinking process finishes. This recovery implies the existence of an internal mechanism helping the model to detect errors and trigger self-correction, which we refer to as the hidden critique ability. Building on feature space analysis, we identify a highly interpretable critique vector representing this behavior. Extensive experiments across multiple model scales and families demonstrate that steering latent representations with this vector improves the model's error detection capability and enhances the performance of test-time scaling at no extra training cost. Our findings provide a valuable understanding of LRMs' critique behavior, suggesting a promising direction to control and improve their self-verification mechanism. Our code is available at: https://github.com/mail-research/lrm-critique-vectors.