LAPO:通过长度自适应策略优化内化推理效率
LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization
July 21, 2025
作者: Xingyu Wu, Yuchen Yan, Shangke Lyu, Linjuan Wu, Yiwen Qiu, Yongliang Shen, Weiming Lu, Jian Shao, Jun Xiao, Yueting Zhuang
cs.AI
摘要
大型推理模型通过扩展的思维链序列取得了显著性能,但这种计算自由度导致即使是简单问题也会产生过多的token生成。我们提出了长度自适应策略优化(LAPO),这是一个新颖的框架,将推理长度控制从外部约束转变为模型的内在能力。与现有方法中施加严格限制或依赖事后干预不同,LAPO通过两阶段强化学习过程使模型内化对适当推理深度的理解。在第一阶段,模型通过发现成功解长度的统计分布来学习自然推理模式。第二阶段则利用这些模式作为元认知指导,将其直接嵌入模型的推理上下文中,以确保推理时的灵活性。在数学推理基准测试上的实验表明,LAPO最多可减少40.9%的token使用,同时提高2.3%的准确率。我们的分析揭示,经过LAPO训练的模型发展出了根据问题复杂度分配计算资源的涌现能力,实现了高效推理而不牺牲质量。
English
Large reasoning models have achieved remarkable performance through extended
chain-of-thought sequences, yet this computational freedom leads to excessive
token generation even for simple problems. We present Length-Adaptive Policy
Optimization (LAPO), a novel framework that transforms reasoning length control
from an external constraint into an intrinsic model capability. Unlike existing
approaches that impose rigid limits or rely on post-hoc interventions, LAPO
enables models to internalize an understanding of appropriate reasoning depth
through a two-stage reinforcement learning process. In the first stage, models
learn natural reasoning patterns by discovering the statistical distribution of
successful solution lengths. The second stage leverages these patterns as
meta-cognitive guidance, embedding them directly within the model's reasoning
context to ensure inference-time flexibility. Experiments on mathematical
reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9\%
while improving accuracy by 2.3\%. Our analysis reveals that models trained
with LAPO develop emergent abilities to allocate computational resources based
on problem complexity, achieving efficient reasoning without sacrificing
quality.