通過元強化學習微調優化測試時計算

摘要

訓練模型以有效利用測試時計算資源，對於提升大型語言模型（LLMs）的推理性能至關重要。當前的方法主要通過對搜索軌跡進行微調或使用0/1結果獎勵進行強化學習（RL）來實現，但這些方法是否高效地利用了測試時計算資源？隨著預算增加，這些方法是否仍能持續擴展？本文試圖回答這些問題。我們將優化測試時計算資源的問題形式化為一個元強化學習（meta-RL）問題，這為如何分配測試時計算資源提供了一個原則性的視角。這一視角使我們能夠將LLM產生的長輸出流視為在測試時運行的多個片段，並引導我們使用輸出標記上的累積遺憾作為衡量測試時計算資源效能的指標。類似於RL算法在訓練中如何最佳地權衡探索與利用，最小化累積遺憾也能在標記流中提供探索與利用之間的最佳平衡。雖然我們展示了最先進的模型並未最小化遺憾，但可以通過最大化密集獎勵獎金與0/1結果獎勵RL相結合來實現這一點。這一獎金是輸出流中每個後續區塊所取得的「進展」，通過最終成功概率的變化來量化。基於這些洞見，我們開發了元強化微調（Meta Reinforcement Fine-Tuning, MRT），這是一類用於優化測試時計算資源的新微調方法。與結果獎勵RL相比，MRT在數學推理任務上帶來了2-3倍的相對性能提升，並在標記效率上實現了約1.5倍的增益。

English

Training models to effectively use test-time compute is crucial for improving the reasoning performance of LLMs. Current methods mostly do so via fine-tuning on search traces or running RL with 0/1 outcome reward, but do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta-reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute. This perspective enables us to view the long output stream from the LLM as consisting of several episodes run at test time and leads us to use a notion of cumulative regret over output tokens as a way to measure the efficacy of test-time compute. Akin to how RL algorithms can best tradeoff exploration and exploitation over training, minimizing cumulative regret would also provide the best balance between exploration and exploitation in the token stream. While we show that state-of-the-art models do not minimize regret, one can do so by maximizing a dense reward bonus in conjunction with the outcome 0/1 reward RL. This bonus is the ''progress'' made by each subsequent block in the output stream, quantified by the change in the likelihood of eventual success. Using these insights, we develop Meta Reinforcement Fine-Tuning, or MRT, a new class of fine-tuning methods for optimizing test-time compute. MRT leads to a 2-3x relative gain in performance and roughly a 1.5x gain in token efficiency for math reasoning compared to outcome-reward RL.

通過元強化學習微調優化測試時計算

Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning

摘要

Support