Pass@k 訓練：用於自適應平衡大型推理模型的探索與利用

摘要

採用可驗證獎勵的強化學習（RLVR），通常以Pass@1作為獎勵，在平衡探索與利用方面面臨挑戰，導致策略傾向於保守行動，收斂至局部最優。因此，確定合適的獎勵指標至關重要。關於先前的研究，儘管Pass@k已被用於評估，但其與RLVR中大語言模型探索能力的關聯在很大程度上被忽視。為探究此問題，我們首先使用Pass@k作為獎勵來訓練策略模型（即Pass@k訓練），並觀察其探索能力的提升。接著，我們推導出Pass@k訓練優勢的解析解，從而實現了一個高效且有效的過程。基於此，我們的分析揭示，探索與利用並非本質上相互衝突的目標，而是可以相互促進。此外，結合解析推導的Pass@k訓練本質上涉及直接設計優勢函數。受此啟發，我們初步探索了RLVR中的優勢設計，展示了令人鼓舞的結果，並指明了一個潛在的未來研究方向。

English

Reinforcement learning with verifiable rewards (RLVR), which typically adopts Pass@1 as the reward, has faced the issues in balancing exploration and exploitation, causing policies to prefer conservative actions, converging to a local optimum. Identifying an appropriate reward metric is therefore crucial. Regarding the prior work, although Pass@k has been used in evaluation, its connection to LLM exploration ability in RLVR remains largely overlooked. To investigate this, we first use Pass@k as the reward to train the policy model (i.e., Pass@k Training), and observe the improvement on its exploration ability. Next, we derive an analytical solution for the advantage of Pass@k Training, leading to an efficient and effective process. Building on this, our analysis reveals that exploration and exploitation are not inherently conflicting objectives, while they can mutually enhance each other. Moreover, Pass@k Training with analytical derivation essentially involves directly designing the advantage function. Inspired by this, we preliminarily explore the advantage design for RLVR, showing promising results and highlighting a potential future direction.

Pass@k 訓練：用於自適應平衡大型推理模型的探索與利用

Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models

摘要

Support