メタ強化学習によるファインチューニングを用いたテスト時計算の最適化

要旨

テスト時の計算リソースを効果的に活用するようモデルを訓練することは、LLMの推論性能を向上させる上で重要です。現在の手法の多くは、検索トレースを用いたファインチューニングや、0/1結果報酬を用いた強化学習（RL）を通じてこれを実現していますが、これらのアプローチはテスト時の計算リソースを効率的に活用しているでしょうか？また、これらのアプローチは計算リソースの予算が増えるにつれてスケールし続けるでしょうか？本論文では、これらの疑問に答えることを試みます。我々は、テスト時の計算リソースの最適化問題をメタ強化学習（RL）問題として形式化し、これによりテスト時の計算リソースの使用に関する原則的な視点を提供します。この視点により、LLMからの長い出力ストリームを、テスト時に実行される複数のエピソードで構成されていると見なし、出力トークンに対する累積的後悔を、テスト時の計算リソースの有効性を測る方法として用いることが可能になります。RLアルゴリズムが訓練中に探索と活用の最適なトレードオフを実現するのと同様に、累積的後悔を最小化することは、トークンストリームにおける探索と活用の最適なバランスを提供します。最先端のモデルが後悔を最小化していないことを示す一方で、これを実現するためには、0/1結果報酬RLと併せて、各出力ブロックがもたらす「進捗」を定量化した密な報酬ボーナスを最大化することが有効です。このボーナスは、最終的な成功の尤度の変化によって定量化されます。これらの洞察を基に、我々はテスト時の計算リソースを最適化する新しいファインチューニング手法のクラスである「メタ強化学習ファインチューニング（MRT）」を開発しました。MRTは、数学的推論において、結果報酬RLと比較して2～3倍の相対的性能向上と、約1.5倍のトークン効率の向上をもたらします。

English

Training models to effectively use test-time compute is crucial for improving the reasoning performance of LLMs. Current methods mostly do so via fine-tuning on search traces or running RL with 0/1 outcome reward, but do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta-reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute. This perspective enables us to view the long output stream from the LLM as consisting of several episodes run at test time and leads us to use a notion of cumulative regret over output tokens as a way to measure the efficacy of test-time compute. Akin to how RL algorithms can best tradeoff exploration and exploitation over training, minimizing cumulative regret would also provide the best balance between exploration and exploitation in the token stream. While we show that state-of-the-art models do not minimize regret, one can do so by maximizing a dense reward bonus in conjunction with the outcome 0/1 reward RL. This bonus is the ''progress'' made by each subsequent block in the output stream, quantified by the change in the likelihood of eventual success. Using these insights, we develop Meta Reinforcement Fine-Tuning, or MRT, a new class of fine-tuning methods for optimizing test-time compute. MRT leads to a 2-3x relative gain in performance and roughly a 1.5x gain in token efficiency for math reasoning compared to outcome-reward RL.

メタ強化学習によるファインチューニングを用いたテスト時計算の最適化

Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning

要旨

Support