대규모 언어 모델 추론을 위한 RLVR 업데이트 방향: 식별과 활용

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델의 추론 능력을 크게 향상시켰다. 기존 분석들은 RLVR에 의해 유도된 변화가 희소하다는 점을 확인하지만, 주로 이러한 업데이트의 규모에 초점을 맞추고 그 방향성은 크게 간과해 왔다. 본 연구에서는 업데이트의 방향이 RLVR의 효과를 이해하는 데 더 중요한 렌즈라고 주장하며, 이는 기본 RLVR 모델과 최종 RLVR 모델 간의 부호가 있는 토큰 수준 로그 확률 차이 Δlog p로 포착될 수 있다고 본다. 통계적 분석과 토큰 대체 개입을 통해 우리는 Δlog p가 규모 기반 지표(예: 발산도 또는 엔트로피)보다 희소하면서도 추론에 중요한 업데이트를 더 효과적으로 식별함을 입증한다. 이러한 통찰을 바탕으로 두 가지 실용적인 응용 방안을 제안한다: (1) 추가 학습 없이 추론 정확도를 향상시키기 위해 학습된 Δlog p 방향으로 정책을 증폭하는 테스트 시점 외삽 방법; (2) 낮은 확률(더 높은 Δlog p에 해당) 토큰에 학습을 집중시키는 학습 시점 재가중 방법으로, 이는 다양한 모델과 벤치마크에서 추론 성능을 향상시킨다. 우리의 연구는 변화의 방향을 RLVR을 분석하고 개선하는 핵심 원리로 정립한다.

English

Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the magnitude of these updates, largely overlooking their direction. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability difference Δlog p between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that Δlog p more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a test-time extrapolation method that amplifies the policy along the learned Δlog p direction to improve reasoning accuracy without further training; (2) a training-time reweighting method that focuses learning on low-probability (corresponding to higher Δlog p) tokens, which improves reasoning performance across models and benchmarks. Our work establishes the direction of change as a key principle for analyzing and improving RLVR.

대규모 언어 모델 추론을 위한 RLVR 업데이트 방향: 식별과 활용

On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

초록

Support