대규모 추론 모델의 탐색과 활용의 적응적 균형을 위한 Pass@k 학습

초록

검증 가능한 보상을 사용한 강화 학습(Reinforcement Learning with Verifiable Rewards, RLVR)은 일반적으로 Pass@1을 보상으로 채택하면서 탐색(exploration)과 활용(exploitation) 간의 균형을 맞추는 데 어려움을 겪어왔고, 이로 인해 정책이 보수적인 행동을 선호하며 지역 최적점(local optimum)에 수렴하는 문제가 발생해왔습니다. 따라서 적절한 보상 지표를 식별하는 것이 중요합니다. 기존 연구와 관련하여, Pass@k가 평가에 사용되었음에도 불구하고, RLVR에서의 대형 언어 모델(LLM) 탐색 능력과의 연관성은 크게 간과되어 왔습니다. 이를 조사하기 위해, 우리는 먼저 Pass@k를 보상으로 사용하여 정책 모델을 학습시키고(즉, Pass@k Training), 그 탐색 능력의 향상을 관찰했습니다. 다음으로, 우리는 Pass@k Training의 이점에 대한 분석적 해결책을 도출하여 효율적이고 효과적인 프로세스를 제시했습니다. 이를 바탕으로, 우리의 분석은 탐색과 활용이 본질적으로 상충되는 목표가 아니며, 오히려 서로를 강화할 수 있음을 보여줍니다. 또한, 분석적 도출을 통한 Pass@k Training은 본질적으로 이점 함수(advantage function)를 직접 설계하는 것을 포함합니다. 이에 영감을 받아, 우리는 RLVR을 위한 이점 설계를 예비적으로 탐구하였고, 이를 통해 유망한 결과를 보여주며 잠재적인 미래 방향성을 강조했습니다.

English

Reinforcement learning with verifiable rewards (RLVR), which typically adopts Pass@1 as the reward, has faced the issues in balancing exploration and exploitation, causing policies to prefer conservative actions, converging to a local optimum. Identifying an appropriate reward metric is therefore crucial. Regarding the prior work, although Pass@k has been used in evaluation, its connection to LLM exploration ability in RLVR remains largely overlooked. To investigate this, we first use Pass@k as the reward to train the policy model (i.e., Pass@k Training), and observe the improvement on its exploration ability. Next, we derive an analytical solution for the advantage of Pass@k Training, leading to an efficient and effective process. Building on this, our analysis reveals that exploration and exploitation are not inherently conflicting objectives, while they can mutually enhance each other. Moreover, Pass@k Training with analytical derivation essentially involves directly designing the advantage function. Inspired by this, we preliminarily explore the advantage design for RLVR, showing promising results and highlighting a potential future direction.

대규모 추론 모델의 탐색과 활용의 적응적 균형을 위한 Pass@k 학습

Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models

초록

Support