Nワールドの最適化：max@k最適化による強化学習とBest-of-Nサンプリングの整合

要旨

検証可能な報酬を用いた強化学習（RLVR）の数学およびコーディング領域への応用は、大規模言語モデルの推論能力と問題解決能力に著しい向上をもたらすことが実証されている。単一生成による問題解決では成功を収めているものの、強化学習によるファインチューニングプロセスは、生成の多様性低下として表れるモデルの探索能力を損なう可能性があり、その結果、大きなN値におけるBest-of-Nサンプリング時の性能劣化を招く。本研究では、pass@kの連続的一般化であるmax@k指標の最適化に焦点を当てる。我々はこの指標を直接最適化するための不偏なオン方策勾配推定を導出する。さらに、導出を現代のRLVRアルゴリズムで一般的なオフ方策更新に拡張し、サンプル効率の向上を可能にする。実験により、提案する目的関数がオフ方策シナリオにおいてmax@k指標を効果的に最適化し、モデルをBest-of-N推論戦略に整合させることを示す。

English

The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model's exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We derive an unbiased on-policy gradient estimate for direct optimization of this metric. Furthermore, we extend our derivations to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively optimizes max@k metric in off-policy scenarios, aligning the model with the Best-of-N inference strategy.

Nワールドの最適化：max@k最適化による強化学習とBest-of-Nサンプリングの整合

The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation

要旨

Support