まばらだが決定的：大規模言語モデルのRLVRファインチューニングにおける分布シフトのトークンレベル分析

要旨

検証可能な報酬を用いた強化学習（RLVR）は大規模言語モデル（LLM）の推論能力を大幅に改善するが、これらの改善をもたらすトークンレベルのメカニズムは未解明である。本研究では、RLVRがもたらす分布的影響に関する体系的な実証研究を、以下の3つの主要分析に沿って提示する：(1) ベースモデルとRLモデル間の分布的シフトのトークンレベル特性評価、(2) クロスサンプリング介入によるトークンレベルの分布的シフトが系列レベル推論性能に与える影響、(3) これらのシフトのトークンレベルにおける微細なメカニズム。我々は、RLファインチューニングが極めて疎的かつ標的化された変化を誘起し、ベース方策とRL方策間で意味のある乖離を示すトークン分布はごく一部であることを発見した。さらに、トークンのエントロピー、位置的な集中度、確率質量の再配分の分析を通じて、これらのシフトの構造と進化を特徴付ける。これらの疎的な変化の機能的重要性を評価するため、介入予算を変えながらベースモデルとRLモデル間でトークン選択を選択的に交換するクロスサンプリング実験を実施する。RLでサンプリングされたトークンのごく一部をベースモデルの生成系列に挿入するだけでRLの性能向上効果が段階的に回復する一方、同程度の少数のベーストークン選択をRL生成系列に注入すると性能はベースレベルに急落することから、RLVRの性能向上に直接寄与する少数のトークンレベル決定群を特定する。最後に、優勢信号の乖離重み付き変種を診断的介入として探索し、それらがベースラインを上回る改善をもたらし得ることを見いだす。総合して、我々の結果はRLVRが誘起する分布的変化を明らかにし、RLVRファインチューニングを標的化された洗練プロセスとして理解するための微細なトークンレベルの視点を提供する。

English

Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms underlying these improvements remain unclear. We present a systematic empirical study of RLVR's distributional effects organized around three main analyses: (1) token-level characterization of distributional shifts between base and RL models, (2) the impact of token-level distributional shifts on sequence-level reasoning performance through cross-sampling interventions, and (3) fine-grained mechanics of these shifts at the token level. We find that RL fine-tuning induces highly sparse and targeted changes, with only a small fraction of token distributions exhibiting meaningful divergence between the base and RL policies. We further characterize the structure and evolution of these shifts through analyses of token entropy, positional concentration, and reallocation of probability mass. To assess the functional importance of these sparse changes, we conduct cross-sampling experiments that selectively swap token choices between the base and RL models with varying intervention budgets. We show that inserting only a small fraction of RL-sampled tokens into base generations progressively recovers RL performance gains, while injecting a similarly small number of base token choices into otherwise RL-generated sequences collapses performance to base levels, isolating a small set of token-level decisions directly responsible for RLVR's performance gains. Finally, we explore divergence-weighted variants of the advantage signal as a diagnostic intervention, finding that they can yield improvements over baselines. Together, our results shed light on the distributional changes induced by RLVR and provide a fine-grained, token-level lens for understanding RLVR fine-tuning as a targeted refinement process.

まばらだが決定的：大規模言語モデルのRLVRファインチューニングにおける分布シフトのトークンレベル分析

Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs

要旨

Support