解码大型推理模型中的批判机制

摘要

大型推理模型（LRMs）具备回溯和自验证机制，使其能够修正中间步骤并得出正确解，从而在复杂逻辑基准上展现强大性能。我们假设，仅当模型具备足够强的“批判”能力以检测自身错误时，此类行为才具有实际效用。本研究通过在中间推理步骤中插入算术错误，系统探究了当前LRMs如何从错误中恢复。值得注意的是，我们发现了一个奇特但重要的现象：尽管错误在整个思维链（CoT）中传播且未出现任何口头修正，模型在思考过程结束后仍能得出正确的最终答案。这种恢复能力暗示存在一种内部机制帮助模型检测错误并触发自我修正，我们称之为隐藏的批判能力。基于特征空间分析，我们识别出一个高度可解释的批判向量，用以表征该行为。跨多个模型规模和系列的广泛实验表明，利用该向量引导潜在表征，能够在不增加训练成本的前提下提升模型的错误检测能力，并增强测试时扩展的性能。我们的研究为理解LRMs的批判行为提供了宝贵见解，并为控制和改进其自验证机制指出了有前景的方向。相关代码已开源：https://github.com/mail-research/lrm-critique-vectors。

English

Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothesize that such behaviors are beneficial only when the model has sufficiently strong ``critique'' ability to detect its own mistakes. This work systematically investigates how current LRMs recover from errors by inserting arithmetic mistakes in their intermediate reasoning steps. Notably, we discover a peculiar yet important phenomenon: despite the error propagating throughout the entire chain-of-thought (CoT) without any verbalized correction, the model still reaches the correct final answer after the thinking process finishes. This recovery implies the existence of an internal mechanism helping the model to detect errors and trigger self-correction, which we refer to as the hidden critique ability. Building on feature space analysis, we identify a highly interpretable critique vector representing this behavior. Extensive experiments across multiple model scales and families demonstrate that steering latent representations with this vector improves the model's error detection capability and enhances the performance of test-time scaling at no extra training cost. Our findings provide a valuable understanding of LRMs' critique behavior, suggesting a promising direction to control and improve their self-verification mechanism. Our code is available at: https://github.com/mail-research/lrm-critique-vectors.