稀疏但关键：大语言模型RLVR微调中分布偏移的标记级分析

摘要

具有可验证奖励的强化学习（RLVR）显著提升了大语言模型（LLM）的推理能力，但这些改进背后的词元级机制尚不明确。我们通过三项核心分析对RLVR的分布效应展开系统性实证研究：（1）基础模型与RL模型间分布偏移的词元级表征；（2）通过交叉采样干预探究词元级分布偏移对序列级推理性能的影响；（3）这些偏移在词元层面的精细作用机制。研究发现，RL微调会引发高度稀疏且目标明确的改变，仅少数词元分布在基础策略与RL策略间出现显著差异。我们进一步通过词元熵值、位置集中度及概率质量重分配等分析，揭示了这些分布偏移的结构特征与演化规律。为评估这些稀疏变化的功能重要性，我们开展交叉采样实验：在设定不同干预预算的条件下，选择性地交换基础模型与RL模型间的词元选择。实验表明，仅需在基础模型生成结果中插入少量RL采样词元，即可逐步恢复RL模型的性能增益；反之，若在RL生成的序列中注入少量基础模型词元选择，性能会迅速衰退至基础水平，由此锁定直接决定RLVR性能增益的关键词元级决策集合。最后，我们探索以优势信号的差异加权变体作为诊断干预手段，发现其能产生超越基线模型的改进效果。本研究共同揭示了RLVR引发的分布变化规律，为理解RLVR微调作为精准优化过程提供了词元级的精细观测视角。

English

Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms underlying these improvements remain unclear. We present a systematic empirical study of RLVR's distributional effects organized around three main analyses: (1) token-level characterization of distributional shifts between base and RL models, (2) the impact of token-level distributional shifts on sequence-level reasoning performance through cross-sampling interventions, and (3) fine-grained mechanics of these shifts at the token level. We find that RL fine-tuning induces highly sparse and targeted changes, with only a small fraction of token distributions exhibiting meaningful divergence between the base and RL policies. We further characterize the structure and evolution of these shifts through analyses of token entropy, positional concentration, and reallocation of probability mass. To assess the functional importance of these sparse changes, we conduct cross-sampling experiments that selectively swap token choices between the base and RL models with varying intervention budgets. We show that inserting only a small fraction of RL-sampled tokens into base generations progressively recovers RL performance gains, while injecting a similarly small number of base token choices into otherwise RL-generated sequences collapses performance to base levels, isolating a small set of token-level decisions directly responsible for RLVR's performance gains. Finally, we explore divergence-weighted variants of the advantage signal as a diagnostic intervention, finding that they can yield improvements over baselines. Together, our results shed light on the distributional changes induced by RLVR and provide a fine-grained, token-level lens for understanding RLVR fine-tuning as a targeted refinement process.

稀疏但关键：大语言模型RLVR微调中分布偏移的标记级分析

Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs

摘要

Support