예산 상대 정책 최적화를 통한 언제든지 추론 최적화

초록

테스트 시간 계산 능력을 확장하는 것은 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 데 있어 중요합니다. 기존의 접근 방식은 일반적으로 강화 학습(RL)을 활용하여 추론 과정의 끝에서 얻을 수 있는 검증 가능한 보상을 극대화합니다. 그러나 이러한 방법은 고정된 큰 토큰 예산 하에서 최종 성능만을 최적화하므로, 학습 및 배포 과정에서 효율성이 저해됩니다. 본 연구에서는 다양한 토큰 예산 제약 하에서 토큰 효율성과 추론의 유연성을 개선하기 위해, AnytimeReasoner라는 새로운 프레임워크를 제안합니다. 이를 위해, 우리는 사전 분포에서 샘플링된 토큰 예산에 맞추어 전체 사고 과정을 단축하고, 모델이 각 단축된 사고에 대해 최적의 답을 요약하여 검증하도록 강제합니다. 이는 추론 과정에 검증 가능한 밀집 보상을 도입함으로써 RL 최적화에서 더 효과적인 신용 할당을 가능하게 합니다. 이후, 우리는 누적 보상을 극대화하기 위해 사고 정책과 요약 정책을 분리하여 최적화합니다. 또한, 사고 정책을 강화할 때 학습 과정의 견고성과 효율성을 높이기 위해 Budget Relative Policy Optimization(BRPO)이라는 새로운 분산 감소 기법을 도입합니다. 수학적 추론 과제에서의 실험 결과는 우리의 방법이 다양한 사전 분포 하에서 모든 사고 예산에 걸쳐 GRPO를 일관되게 능가하며, 학습 및 토큰 효율성을 모두 향상시킴을 보여줍니다.

English

Scaling test-time compute is crucial for enhancing the reasoning capabilities of large language models (LLMs). Existing approaches typically employ reinforcement learning (RL) to maximize a verifiable reward obtained at the end of reasoning traces. However, such methods optimize only the final performance under a large and fixed token budget, which hinders efficiency in both training and deployment. In this work, we present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance, which aims to improve token efficiency and the flexibility of reasoning under varying token budget constraints. To achieve this, we truncate the complete thinking process to fit within sampled token budgets from a prior distribution, compelling the model to summarize the optimal answer for each truncated thinking for verification. This introduces verifiable dense rewards into the reasoning process, facilitating more effective credit assignment in RL optimization. We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward. Additionally, we introduce a novel variance reduction technique, Budget Relative Policy Optimization (BRPO), to enhance the robustness and efficiency of the learning process when reinforcing the thinking policy. Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency.

예산 상대 정책 최적화를 통한 언제든지 추론 최적화

Optimizing Anytime Reasoning via Budget Relative Policy Optimization

초록

Support