LAPO：通過長度自適應策略優化內化推理效率

摘要

大型推理模型通过延长的思维链序列取得了显著的性能，然而这种计算自由度导致即使对于简单问题也会产生过多的标记生成。我们提出了长度自适应策略优化（LAPO），这是一个将推理长度控制从外部约束转化为模型内在能力的新颖框架。与现有方法中施加刚性限制或依赖事后干预不同，LAPO通过两阶段强化学习过程使模型内化对适当推理深度的理解。在第一阶段，模型通过发现成功解答长度的统计分布来学习自然的推理模式。第二阶段则利用这些模式作为元认知指导，将其直接嵌入模型的推理上下文中，以确保推理时的灵活性。在数学推理基准测试上的实验表明，LAPO将标记使用量减少了高达40.9%，同时准确率提高了2.3%。我们的分析揭示，经过LAPO训练的模型发展出了根据问题复杂性分配计算资源的涌现能力，实现了高效推理而不牺牲质量。

English

Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model's reasoning context to ensure inference-time flexibility. Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9\% while improving accuracy by 2.3\%. Our analysis reveals that models trained with LAPO develop emergent abilities to allocate computational resources based on problem complexity, achieving efficient reasoning without sacrificing quality.

LAPO：通過長度自適應策略優化內化推理效率

LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization

摘要

Support