Composition-RL: 検証可能なプロンプトを構成する大規模言語モデルの強化学習

要旨

大規模な検証可能なプロンプトは、検証可能な報酬を用いた強化学習（RLVR）の成功を支えているが、それらには多くの非情報的例が含まれており、さらに拡張するにはコストがかかる。最近の研究では、ロールアウト通過率が0である困難なプロンプトを優先することで、限られた訓練データをより効果的に活用することに焦点が当てられている。しかし、訓練が進むにつれて通過率1の容易なプロンプトも次第に増加し、実質的なデータサイズを減少させている。この問題を緩和するため、我々は通過率1のプロンプトを対象に、限られた検証可能なプロンプトをより有効に活用するための簡潔かつ有用な手法Composition-RLを提案する。具体的には、Composition-RLは複数の問題を自動的に組み合わせて新たな検証可能な質問を生成し、これらの合成的プロンプトをRL訓練に利用する。4Bから30Bまでの様々なモデルサイズにおける大規模な実験により、Composition-RLが元のデータセットで訓練したRLを一貫して上回る推論能力の向上をもたらすことが示された。さらに、訓練過程中に合成的深度を段階的に増加させるカリキュラム学習版Composition-RLを適用することで、性能をさらに向上させることができる。加えて、Composition-RLは異なるドメインから抽出したプロンプトを組み合わせることで、より効果的なクロスドメインRLを実現する。コード、データセット、モデルはhttps://github.com/XinXU-USTC/Composition-RLで公開されている。

English

Large-scale verifiable prompts underpin the success of Reinforcement Learning with Verifiable Rewards (RLVR), but they contain many uninformative examples and are costly to expand further. Recent studies focus on better exploiting limited training data by prioritizing hard prompts whose rollout pass rate is 0. However, easy prompts with a pass rate of 1 also become increasingly prevalent as training progresses, thereby reducing the effective data size. To mitigate this, we propose Composition-RL, a simple yet useful approach for better utilizing limited verifiable prompts targeting pass-rate-1 prompts. More specifically, Composition-RL automatically composes multiple problems into a new verifiable question and uses these compositional prompts for RL training. Extensive experiments across model sizes from 4B to 30B show that Composition-RL consistently improves reasoning capability over RL trained on the original dataset. Performance can be further boosted with a curriculum variant of Composition-RL that gradually increases compositional depth over training. Additionally, Composition-RL enables more effective cross-domain RL by composing prompts drawn from different domains. Codes, datasets, and models are available at https://github.com/XinXU-USTC/Composition-RL.

Composition-RL: 検証可能なプロンプトを構成する大規模言語モデルの強化学習

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

要旨

Support