희소하지만 핵심적: LLM RLVR 미세 조정에서 발생하는 분포 변화의 토큰 수준 분석

초록

검증 가능한 보상을 활용한 강화학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 크게 향상시켰으나, 이러한 향상을 이끄는 토큰 수준의 메커니즘은 여전히 명확하지 않다. 본 연구에서는 RLVR의 분포적 효과에 대한 체계적인 실증 연구를 세 가지 주요 분석을 중심으로 제시한다: (1) 기본 모델과 RL 모델 간 분포 변화의 토큰 수준 특성 분석, (2) 교차 샘플링 개입을 통한 토큰 수준 분포 변화가 시퀀스 수준 추론 성능에 미치는 영향 분석, (3) 토큰 수준에서 발생하는 이러한 변화의 세부 작동 메커니즘 분석. 우리는 RL 미세 조정이 매우 희소하고 표적화된 변화를 유도하며, 기본 정책과 RL 정책 간에 의미 있는 차이를 보이는 토큰 분포는 극히 일부에 불과하다는 사실을 발견했다. 또한 토큰 엔트로피, 위치적 집중도, 확률 질량 재배분 분석을 통해 이러한 변화의 구조와 진화 과정을 규명했다. 이러한 희소한 변화의 기능적 중요성을 평가하기 위해, 다양한 개입 예산 하에서 기본 모델과 RL 모델 간 토큰 선택을 선택적으로 교체하는 교차 샘플링 실험을 수행했다. RL 생성 토큰의 극히 일부만을 기본 모델 생성 결과에 삽입해도 RL의 성능 향상이 점진적으로 회복되는 반면, RL로 생성된 시퀀스에 기본 모델의 토큰 선택을 소량 주입하면 성능이 기본 수준으로 급락하여, RLVR의 성능 향상을 직접적으로 책임지는 소수의 토큰 수준 결정을 격리해낼 수 있음을 보여준다. 마지막으로, 이점 신호의 발산 가중 변형을 진단적 개입으로 탐색한 결과, 이를 통해 기준선 대비 향상을 얻을 수 있음을 확인했다. 종합적으로, 우리의 결과는 RLVR이 유도하는 분포 변화를 밝히고, RLVR 미세 조정이 표적 정제 과정임을 이해하는 세밀한 토큰 수준의 렌즈를 제공한다.

English

Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms underlying these improvements remain unclear. We present a systematic empirical study of RLVR's distributional effects organized around three main analyses: (1) token-level characterization of distributional shifts between base and RL models, (2) the impact of token-level distributional shifts on sequence-level reasoning performance through cross-sampling interventions, and (3) fine-grained mechanics of these shifts at the token level. We find that RL fine-tuning induces highly sparse and targeted changes, with only a small fraction of token distributions exhibiting meaningful divergence between the base and RL policies. We further characterize the structure and evolution of these shifts through analyses of token entropy, positional concentration, and reallocation of probability mass. To assess the functional importance of these sparse changes, we conduct cross-sampling experiments that selectively swap token choices between the base and RL models with varying intervention budgets. We show that inserting only a small fraction of RL-sampled tokens into base generations progressively recovers RL performance gains, while injecting a similarly small number of base token choices into otherwise RL-generated sequences collapses performance to base levels, isolating a small set of token-level decisions directly responsible for RLVR's performance gains. Finally, we explore divergence-weighted variants of the advantage signal as a diagnostic intervention, finding that they can yield improvements over baselines. Together, our results shed light on the distributional changes induced by RLVR and provide a fine-grained, token-level lens for understanding RLVR fine-tuning as a targeted refinement process.

희소하지만 핵심적: LLM RLVR 미세 조정에서 발생하는 분포 변화의 토큰 수준 분석

Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs

초록

Support