ChatPaper.aiChatPaper

SimKO:简易Pass@K策略优化

SimKO: Simple Pass@K Policy Optimization

October 16, 2025
作者: Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, Yandong Wen
cs.AI

摘要

基于可验证奖励的强化学习(RLVR)显著提升了大型语言模型(LLMs)的推理能力。然而,现有的RLVR方法普遍存在一种系统性偏差,即倾向于利用而非探索,这表现为pass@1指标提升而pass@K(K>1)性能下降。为深入理解这一问题,我们通过追踪词汇候选集上的令牌级概率分布,分析了RLVR方法的训练动态。分析揭示了一种一致的概率集中效应,即排名第一的候选词逐渐积累概率质量,同时抑制其他候选词的概率。更重要的是,这种过度集中现象与较差的pass@K性能呈正相关。受此发现启发,我们提出了简单Pass@K优化方法(SimKO),旨在缓解过度集中问题,从而鼓励探索。SimKO采用非对称方式运作:对于已验证正确的响应,它提升前K个候选词的概率;而对于已验证错误的响应,则对排名第一的候选词施加更强的惩罚。我们观察到,这种非对称设计在应用于高熵令牌时,对缓解过度集中尤为有效。在多种数学与逻辑推理基准测试中,SimKO在广泛的K值范围内均能持续提升pass@K性能,为改进RLVR的探索提供了一种简便途径。
English
Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models (LLMs). However, prevailing RLVR methods exhibit a systematic bias toward exploitation over exploration, as evidenced by improved pass@1 but reduced pass@K (K>1) performance. To understand this issue, we analyze training dynamics of RLVR methods by tracking the token-level probability distributions over vocabulary candidates. Our analysis reveals a consistent probability concentration effect where the top-1 candidate increasingly accumulates probability mass and suppresses that of other candidates. More importantly, stronger over-concentration correlates with worse pass@K performance. Inspired by this finding, we propose Simple Pass@K Optimization (SimKO), a method designed to mitigate the over-concentration issue, thereby encouraging exploration. SimKO operates in an asymmetrical manner. For verified-correct responses, it boosts the probabilities of the top-K candidates. For verified-incorrect responses, it applies stronger penalties to the top-1 candidate. We observe that this asymmetric design is particularly effective at mitigating over-concentration when applied at tokens with high entropy. Across various math and logical-reasoning benchmarks, SimKO consistently yields higher pass@K for a wide range of K, providing a simple way to improve RLVR's exploration.
PDF102December 21, 2025