モデルとタスクの整合性が異なる強化学習の結果を導く

要旨

大規模言語モデル（LLM）に対する強化学習（RL）の応用における最近の進展は、大きな進歩をもたらしている。特に、LLMにおいては、従来のRL設定では通常観察されないパターンを示す、注目すべきでありながらしばしば直感に反する現象が数多く報告されている。例えば、単一の訓練例がデータセット全体を用いた場合の性能に匹敵する、報酬信号が非常に正確である必要はない、負のサンプルのみを用いた訓練が洗練された報酬ベースの手法に匹敵またはそれを上回る、といった主張がなされている。しかし、これらの観察が成立する正確な条件、そして重要なことに、それらが失敗する条件は依然として不明である。本研究では、RLの観察結果を区別する重要な要因として、事前訓練済みモデルが評価対象タスクにおけるpass@k精度によって測定される強力なモデル-タスクアラインメントを示しているかどうかを特定する。一連の直感に反する主張を体系的かつ包括的に検証し、異なるモデルアーキテクチャとタスク領域にわたる厳密な実験的検証を通じて、標準的なRL訓練は設定全体で一貫して堅牢である一方、これらの直感に反する結果の多くは、モデルとタスクが既に強力なモデル-タスクアラインメントを示している場合にのみ生じることが明らかとなった。対照的に、これらの手法は、標準的なRL手法が有効であるより困難な状況では、学習を大幅に促進することができない。

English

Recent advances in applying reinforcement learning (RL) to large language models (LLMs) have led to substantial progress. In particular, a series of remarkable yet often counterintuitive phenomena have been reported in LLMs, exhibiting patterns not typically observed in traditional RL settings. For example, notable claims include that a single training example can match the performance achieved with an entire dataset, that the reward signal does not need to be very accurate, and that training solely with negative samples can match or even surpass sophisticated reward-based methods. However, the precise conditions under which these observations hold - and, critically, when they fail - remain unclear. In this work, we identify a key factor that differentiates RL observations: whether the pretrained model already exhibits strong Model-Task Alignment, as measured by pass@k accuracy on the evaluated task. Through a systematic and comprehensive examination of a series of counterintuitive claims, supported by rigorous experimental validation across different model architectures and task domains, our findings show that while standard RL training remains consistently robust across settings, many of these counterintuitive results arise only when the model and task already exhibit strong model-task alignment. In contrast, these techniques fail to drive substantial learning in more challenging regimes, where standard RL methods remain effective.

モデルとタスクの整合性が異なる強化学習の結果を導く

Model-Task Alignment Drives Distinct RL Outcomes

要旨

Support