SimKO:簡易Pass@K策略優化
SimKO: Simple Pass@K Policy Optimization
October 16, 2025
作者: Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, Yandong Wen
cs.AI
摘要
基於可驗證獎勵的強化學習(RLVR)已顯著提升了大型語言模型(LLMs)的推理能力。然而,現行的RLVR方法普遍存在一種系統性偏差,即偏向於利用而非探索,這體現在pass@1性能提升的同時,pass@K(K>1)性能卻有所下降。為深入理解這一問題,我們通過追蹤詞彙候選項上的令牌級概率分佈,分析了RLVR方法的訓練動態。分析揭示了一種一致的概率集中效應,即排名第一的候選項逐漸累積概率質量,並壓制其他候選項。更重要的是,過度集中與較差的pass@K性能呈正相關。基於這一發現,我們提出了簡單的Pass@K優化方法(SimKO),旨在緩解過度集中問題,從而鼓勵探索。SimKO以非對稱方式運作:對於驗證正確的回應,它提升前K個候選項的概率;對於驗證錯誤的回應,它對排名第一的候選項施加更強的懲罰。我們觀察到,當應用於高熵令牌時,這種非對稱設計在緩解過度集中方面尤為有效。在各種數學和邏輯推理基準測試中,SimKO在廣泛的K值範圍內持續帶來更高的pass@K,為改進RLVR的探索提供了一種簡便途徑。
English
Reinforcement learning with verifiable rewards (RLVR) has advanced the
reasoning capabilities of large language models (LLMs). However, prevailing
RLVR methods exhibit a systematic bias toward exploitation over exploration, as
evidenced by improved pass@1 but reduced pass@K (K>1) performance. To
understand this issue, we analyze training dynamics of RLVR methods by tracking
the token-level probability distributions over vocabulary candidates. Our
analysis reveals a consistent probability concentration effect where the top-1
candidate increasingly accumulates probability mass and suppresses that of
other candidates. More importantly, stronger over-concentration correlates with
worse pass@K performance. Inspired by this finding, we propose Simple Pass@K
Optimization (SimKO), a method designed to mitigate the over-concentration
issue, thereby encouraging exploration. SimKO operates in an asymmetrical
manner. For verified-correct responses, it boosts the probabilities of the
top-K candidates. For verified-incorrect responses, it applies stronger
penalties to the top-1 candidate. We observe that this asymmetric design is
particularly effective at mitigating over-concentration when applied at tokens
with high entropy. Across various math and logical-reasoning benchmarks, SimKO
consistently yields higher pass@K for a wide range of K, providing a simple way
to improve RLVR's exploration.