메타 강화 학습 미세 조정을 통한 테스트 시간 계산 최적화

초록

테스트 시간 계산 자원을 효과적으로 활용하도록 모델을 훈련시키는 것은 LLM의 추론 성능을 향상시키는 데 중요합니다. 현재의 방법들은 주로 검색 트레이스에 대한 미세 조정이나 0/1 결과 보상을 사용한 강화 학습(RL)을 통해 이를 수행하지만, 이러한 접근 방식이 테스트 시간 계산 자원을 효율적으로 활용하고 있을까요? 또한, 이러한 접근 방식은 예산이 증가함에 따라 계속 확장될 수 있을까요? 본 논문에서는 이러한 질문에 답하고자 합니다. 우리는 테스트 시간 계산 자원 최적화 문제를 메타 강화 학습(RL) 문제로 공식화하여, 테스트 시간 계산 자원을 사용하는 데 있어 원칙적인 관점을 제공합니다. 이 관점은 LLM에서 생성된 긴 출력 스트림을 테스트 시간 동안 실행된 여러 에피소드로 보게 하고, 출력 토큰에 대한 누적 후회(cumulative regret) 개념을 테스트 시간 계산 자원의 효율성을 측정하는 방법으로 사용하도록 이끕니다. 강화 학습 알고리즘이 훈련 중 탐색과 활용 사이의 최적의 균형을 맞추는 것과 유사하게, 누적 후회를 최소화하는 것은 토큰 스트림에서 탐색과 활용 사이의 최적의 균형을 제공할 것입니다. 우리는 최신 모델들이 후회를 최소화하지 않음을 보여주지만, 0/1 결과 보상 RL과 함께 밀집된 보상 보너스를 최대화함으로써 이를 달성할 수 있습니다. 이 보너스는 출력 스트림에서 각 후속 블록이 만드는 '진행(progress)'으로, 최종 성공 가능성의 변화로 정량화됩니다. 이러한 통찰을 바탕으로, 우리는 테스트 시간 계산 자원을 최적화하기 위한 새로운 미세 조정 방법인 Meta Reinforcement Fine-Tuning(MRT)을 개발합니다. MRT는 결과 보상 RL에 비해 수학적 추론에서 2-3배의 상대적 성능 향상과 약 1.5배의 토큰 효율성 향상을 이끌어냅니다.

English

Training models to effectively use test-time compute is crucial for improving the reasoning performance of LLMs. Current methods mostly do so via fine-tuning on search traces or running RL with 0/1 outcome reward, but do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta-reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute. This perspective enables us to view the long output stream from the LLM as consisting of several episodes run at test time and leads us to use a notion of cumulative regret over output tokens as a way to measure the efficacy of test-time compute. Akin to how RL algorithms can best tradeoff exploration and exploitation over training, minimizing cumulative regret would also provide the best balance between exploration and exploitation in the token stream. While we show that state-of-the-art models do not minimize regret, one can do so by maximizing a dense reward bonus in conjunction with the outcome 0/1 reward RL. This bonus is the ''progress'' made by each subsequent block in the output stream, quantified by the change in the likelihood of eventual success. Using these insights, we develop Meta Reinforcement Fine-Tuning, or MRT, a new class of fine-tuning methods for optimizing test-time compute. MRT leads to a 2-3x relative gain in performance and roughly a 1.5x gain in token efficiency for math reasoning compared to outcome-reward RL.

메타 강화 학습 미세 조정을 통한 테스트 시간 계산 최적화

Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning

초록

Support