LSPO: LLM推論におけるポリシー最適化のための長さを考慮した動的サンプリング

要旨

Deepseek-R1のリリース以来、検証可能な報酬を用いた強化学習（RLVR）は、推論タスクにおける大規模言語モデル（LLM）の訓練において中心的なアプローチとなっている。最近の研究では、RLVRをより効率的かつ効果的にするために損失関数の修正に焦点が当てられてきた。本論文では、LLMにおける過剰思考（overthinking）に関する研究に着想を得て、平均応答長に基づいて各ステップで訓練データを動的に選択する新しいメタRLVRアルゴリズム、Length-aware Sampling for Policy Optimization（LSPO）を提案する。我々はLSPOを複数のベースモデルとデータセットで評価し、それが一貫して学習効果を向上させることを実証する。さらに、長さの信号を動的サンプリングに組み込む代替方法を検討する詳細なアブレーション研究を行い、さらなる洞察を提供し、将来の研究に向けた有望な方向性を提示する。

English

Since the release of Deepseek-R1, reinforcement learning with verifiable rewards (RLVR) has become a central approach for training large language models (LLMs) on reasoning tasks. Recent work has largely focused on modifying loss functions to make RLVR more efficient and effective. In this paper, motivated by studies of overthinking in LLMs, we propose Length-aware Sampling for Policy Optimization (LSPO), a novel meta-RLVR algorithm that dynamically selects training data at each step based on the average response length. We evaluate LSPO across multiple base models and datasets, demonstrating that it consistently improves learning effectiveness. In addition, we conduct a detailed ablation study to examine alternative ways of incorporating length signals into dynamic sampling, offering further insights and highlighting promising directions for future research.

LSPO: LLM推論におけるポリシー最適化のための長さを考慮した動的サンプリング

LSPO: Length-aware Dynamic Sampling for Policy Optimization in LLM Reasoning

要旨

Support