大規模推論モデルの探索と活用を適応的にバランスさせるためのPass@kトレーニング

要旨

検証可能な報酬を用いた強化学習（RLVR）は、通常Pass@1を報酬として採用しているが、探索と活用のバランスを取る際に課題に直面し、保守的な行動を好むポリシーが局所最適に収束してしまう。そのため、適切な報酬指標を特定することが重要である。先行研究に関しては、Pass@kが評価に使用されてきたものの、RLVRにおける大規模言語モデル（LLM）の探索能力との関連性はほとんど注目されていない。これを調査するため、まずPass@kを報酬としてポリシーモデルを訓練し（すなわちPass@k Training）、その探索能力の向上を観察する。次に、Pass@k Trainingの利点に関する解析的な解を導出し、効率的かつ効果的なプロセスを実現する。これに基づき、分析から探索と活用は本質的に相反する目的ではなく、互いに強化し合えることが明らかになる。さらに、解析的導出を伴うPass@k Trainingは、本質的に利得関数を直接設計することを含む。これに着想を得て、RLVRのための利得設計を予備的に探求し、有望な結果を示すとともに、将来の潜在的な方向性を強調する。

English

Reinforcement learning with verifiable rewards (RLVR), which typically adopts Pass@1 as the reward, has faced the issues in balancing exploration and exploitation, causing policies to prefer conservative actions, converging to a local optimum. Identifying an appropriate reward metric is therefore crucial. Regarding the prior work, although Pass@k has been used in evaluation, its connection to LLM exploration ability in RLVR remains largely overlooked. To investigate this, we first use Pass@k as the reward to train the policy model (i.e., Pass@k Training), and observe the improvement on its exploration ability. Next, we derive an analytical solution for the advantage of Pass@k Training, leading to an efficient and effective process. Building on this, our analysis reveals that exploration and exploitation are not inherently conflicting objectives, while they can mutually enhance each other. Moreover, Pass@k Training with analytical derivation essentially involves directly designing the advantage function. Inspired by this, we preliminarily explore the advantage design for RLVR, showing promising results and highlighting a potential future direction.

大規模推論モデルの探索と活用を適応的にバランスさせるためのPass@kトレーニング

Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models

要旨

Support