从优化视角审视大语言模型的思维修正

摘要

近期大型语言模型（LLM）的发展得益于其涌现的推理能力，特别是通过长链思维（CoT）提示技术实现了全面探索与深度思考。然而尽管取得这些进展，长链CoT模型常表现出次优推理行为，如过度思考与推理链条冗长等问题，反而影响推理性能。本文从优化视角分析推理过程，将CoT构建为梯度下降过程——每个推理步骤都是向问题解决的迭代更新。基于此视角，我们提出RePro（过程级奖励校正）这一后训练阶段优化LLM推理的新方法。该方法通过定义代理目标函数评估CoT背后的优化过程，采用双评分机制量化推理强度与稳定性，并将评分聚合为复合型过程级奖励，无缝集成至带可验证奖励的强化学习（RLVR）流程中以优化模型。在数学、科学与编程等多领域基准测试中，针对不同强化学习算法与多样LLM开展的广泛实验表明，RePro能持续提升推理性能并有效缓解次优推理行为。

English

Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.

从优化视角审视大语言模型的思维修正

Rectifying LLM Thought from Lens of Optimization

摘要

Support