LAPO：長さ適応型ポリシー最適化による推論効率の内部化

要旨

大規模な推論モデルは、長い連鎖思考シーケンスを通じて顕著な性能を達成してきましたが、この計算上の自由度は、単純な問題に対しても過剰なトークン生成を引き起こします。本論文では、Length-Adaptive Policy Optimization (LAPO) を提案します。これは、推論長制御を外部制約からモデルの内在的な能力へと変換する新しいフレームワークです。既存のアプローチが厳格な制限を課したり事後介入に依存したりするのとは異なり、LAPO は二段階の強化学習プロセスを通じて、モデルが適切な推論深度を内在的に理解することを可能にします。第一段階では、モデルは成功した解法の長さの統計的分布を発見することで、自然な推論パターンを学習します。第二段階では、これらのパターンをメタ認知的ガイダンスとして活用し、推論コンテキストに直接埋め込むことで、推論時の柔軟性を確保します。数学的推論ベンチマークでの実験により、LAPO がトークン使用量を最大 40.9% 削減しつつ、精度を 2.3% 向上させることが実証されました。分析の結果、LAPO で訓練されたモデルは、問題の複雑さに基づいて計算リソースを割り当てる創発的能力を発展させ、品質を犠牲にすることなく効率的な推論を実現することが明らかになりました。

English

Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model's reasoning context to ensure inference-time flexibility. Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9\% while improving accuracy by 2.3\%. Our analysis reveals that models trained with LAPO develop emergent abilities to allocate computational resources based on problem complexity, achieving efficient reasoning without sacrificing quality.

LAPO：長さ適応型ポリシー最適化による推論効率の内部化

LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization

要旨

Support