SimKO：简易Pass@K策略优化

摘要

基于可验证奖励的强化学习（RLVR）显著提升了大型语言模型（LLMs）的推理能力。然而，现有的RLVR方法普遍存在一种系统性偏差，即倾向于利用而非探索，这表现为pass@1指标提升而pass@K（K>1）性能下降。为深入理解这一问题，我们通过追踪词汇候选集上的令牌级概率分布，分析了RLVR方法的训练动态。分析揭示了一种一致的概率集中效应，即排名第一的候选词逐渐积累概率质量，同时抑制其他候选词的概率。更重要的是，这种过度集中现象与较差的pass@K性能呈正相关。受此发现启发，我们提出了简单Pass@K优化方法（SimKO），旨在缓解过度集中问题，从而鼓励探索。SimKO采用非对称方式运作：对于已验证正确的响应，它提升前K个候选词的概率；而对于已验证错误的响应，则对排名第一的候选词施加更强的惩罚。我们观察到，这种非对称设计在应用于高熵令牌时，对缓解过度集中尤为有效。在多种数学与逻辑推理基准测试中，SimKO在广泛的K值范围内均能持续提升pass@K性能，为改进RLVR的探索提供了一种简便途径。

English

Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models (LLMs). However, prevailing RLVR methods exhibit a systematic bias toward exploitation over exploration, as evidenced by improved pass@1 but reduced pass@K (K>1) performance. To understand this issue, we analyze training dynamics of RLVR methods by tracking the token-level probability distributions over vocabulary candidates. Our analysis reveals a consistent probability concentration effect where the top-1 candidate increasingly accumulates probability mass and suppresses that of other candidates. More importantly, stronger over-concentration correlates with worse pass@K performance. Inspired by this finding, we propose Simple Pass@K Optimization (SimKO), a method designed to mitigate the over-concentration issue, thereby encouraging exploration. SimKO operates in an asymmetrical manner. For verified-correct responses, it boosts the probabilities of the top-K candidates. For verified-incorrect responses, it applies stronger penalties to the top-1 candidate. We observe that this asymmetric design is particularly effective at mitigating over-concentration when applied at tokens with high entropy. Across various math and logical-reasoning benchmarks, SimKO consistently yields higher pass@K for a wide range of K, providing a simple way to improve RLVR's exploration.

SimKO：简易Pass@K策略优化

SimKO: Simple Pass@K Policy Optimization

摘要

Support