SimKO:简易Pass@K策略优化
SimKO: Simple Pass@K Policy Optimization
October 16, 2025
作者: Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, Yandong Wen
cs.AI
摘要
基于可验证奖励的强化学习(RLVR)显著提升了大型语言模型(LLMs)的推理能力。然而,现有的RLVR方法普遍存在一种系统性偏差,即倾向于利用而非探索,这表现为pass@1指标提升而pass@K(K>1)性能下降。为深入理解这一问题,我们通过追踪词汇候选集上的令牌级概率分布,分析了RLVR方法的训练动态。分析揭示了一种一致的概率集中效应,即排名第一的候选词逐渐积累概率质量,同时抑制其他候选词的概率。更重要的是,这种过度集中现象与较差的pass@K性能呈正相关。受此发现启发,我们提出了简单Pass@K优化方法(SimKO),旨在缓解过度集中问题,从而鼓励探索。SimKO采用非对称方式运作:对于已验证正确的响应,它提升前K个候选词的概率;而对于已验证错误的响应,则对排名第一的候选词施加更强的惩罚。我们观察到,这种非对称设计在应用于高熵令牌时,对缓解过度集中尤为有效。在多种数学与逻辑推理基准测试中,SimKO在广泛的K值范围内均能持续提升pass@K性能,为改进RLVR的探索提供了一种简便途径。
English
Reinforcement learning with verifiable rewards (RLVR) has advanced the
reasoning capabilities of large language models (LLMs). However, prevailing
RLVR methods exhibit a systematic bias toward exploitation over exploration, as
evidenced by improved pass@1 but reduced pass@K (K>1) performance. To
understand this issue, we analyze training dynamics of RLVR methods by tracking
the token-level probability distributions over vocabulary candidates. Our
analysis reveals a consistent probability concentration effect where the top-1
candidate increasingly accumulates probability mass and suppresses that of
other candidates. More importantly, stronger over-concentration correlates with
worse pass@K performance. Inspired by this finding, we propose Simple Pass@K
Optimization (SimKO), a method designed to mitigate the over-concentration
issue, thereby encouraging exploration. SimKO operates in an asymmetrical
manner. For verified-correct responses, it boosts the probabilities of the
top-K candidates. For verified-incorrect responses, it applies stronger
penalties to the top-1 candidate. We observe that this asymmetric design is
particularly effective at mitigating over-concentration when applied at tokens
with high entropy. Across various math and logical-reasoning benchmarks, SimKO
consistently yields higher pass@K for a wide range of K, providing a simple way
to improve RLVR's exploration.