TTRL: テスト時強化学習

要旨

本論文は、大規模言語モデル（LLM）における推論タスクのための明示的なラベルなしデータに対する強化学習（RL）を調査する。この問題の核心的な課題は、推論中の報酬推定であり、その際に真の情報にアクセスできない点にある。この設定は一見すると捉えどころがないように見えるが、テストタイムスケーリング（TTS）における多数決などの一般的な手法が、RLトレーニングを駆動するのに適した驚くほど効果的な報酬を生み出すことがわかった。本研究では、ラベルなしデータを用いてLLMをRLでトレーニングする新しい手法であるテストタイム強化学習（TTRL）を提案する。TTRLは、事前学習済みモデルの事前知識を活用することで、LLMの自己進化を可能にする。実験結果は、TTRLがさまざまなタスクとモデルにおいて一貫して性能を向上させることを示している。特に、TTRLはラベルなしのテストデータのみを使用して、Qwen-2.5-Math-7BのAIME 2024におけるpass@1性能を約159%向上させた。さらに、TTRLはMaj@Nメトリックのみで監督されているにもかかわらず、初期モデルの上限を一貫して超え、真のラベル付きテストデータで直接トレーニングされたモデルの性能に近づくことが実証された。実験結果は、TTRLのさまざまなタスクにおける一般的な有効性を検証し、TTRLがより広範なタスクやドメインに適用可能な潜在能力を有していることを強調している。GitHub: https://github.com/PRIME-RL/TTRL

English

This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 159% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the Maj@N metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks, and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL

TTRL: テスト時強化学習

TTRL: Test-Time Reinforcement Learning

要旨

Support