从优化视角审视LLM思维的修正

摘要

近期大语言模型（LLM）的进步主要源于其涌现的推理能力，特别是通过长链思维（CoT）提示技术实现了全面探索与深度思考。然而，长链CoT模型常表现出次优推理行为，如过度思考与推理链条冗长等问题，反而可能损害性能。本文通过优化视角分析推理过程，将CoT构建为梯度下降流程——每个推理步骤都是向问题解决的迭代更新。基于此视角，我们提出RePro（过程级奖励校正）方法，用于在训练后阶段优化LLM推理。RePro通过定义代理目标函数评估CoT背后的优化过程，采用双评分机制量化其推理强度与稳定性。这些分数被聚合为复合型过程级奖励，无缝集成至带可验证奖励的强化学习（RLVR）框架中以优化LLM。在数学、科学和编程等多领域基准测试中，通过多种强化学习算法与不同LLM的大规模实验表明，RePro能持续提升推理性能并有效缓解次优推理行为。

English

Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.

从优化视角审视LLM思维的修正

Rectifying LLM Thought from Lens of Optimization

摘要

Support