탐색과 활용의 딜레마를 넘어서: RLVR에서의 LLM 추론을 위한 숨겨진 상태 접근법

초록

검증 가능한 보상을 위한 강화 학습(RLVR) 분야에서의 주류 관점은 최근의 진전을 탐색과 활용의 상충 관계라는 렌즈를 통해 해석하며, 이는 주로 토큰 수준의 지표에 의해 형성된 관점입니다. 우리는 이 관점을 재검토하며, 이러한 상충 관계가 근본적인 제약이 아니라 측정 수준의 부산물일 수 있다는 가능성을 제안합니다. 이를 조사하기 위해, 우리는 분석을 의미론적으로 풍부한 은닉 상태 공간으로 전환하고, 탐색을 정량화하기 위해 효과적 순위(ER)를 채택하며, 활용 역학을 포착하기 위해 효과적 순위 속도(ERV)와 효과적 순위 가속도(ERA)라는 새로운 1차 및 2차 미분을 제안합니다. 우리의 분석은 은닉 상태 수준에서 탐색과 활용이 분리될 수 있음을 보여줍니다(섹션 4). 이 발견은 두 역량을 동시에 향상시킬 수 있는 기회를 드러냅니다. 이러한 통찰은 우리의 방법인 속도-활용 순위 학습(VERL)을 동기 부여하며, 이는 RL 이점 함수를 직접 조정하여 상호 보완적인 탐색-활용 강화 원칙을 최초로 구현한 방법입니다. 핵심 혁신은 이론적으로 안정적인 ERA를 예측 메타 컨트롤러로 활용하여 상호 보완적인 이중 채널 인센티브 구조를 만드는 것입니다. VERL은 상충 관계를 강제하는 대신, 탐색에 대한 보상을 사전에 증폭하여 과신을 방지하고, 추론을 공고히 하기 위해 활용적 이득을 강화합니다. 다양한 LLM과 추론 벤치마크에서의 실험은 일관된 성과를 보여주며, 특히 어려운 Gaokao 2024 데이터셋에서 최대 21.4%의 절대 정확도 향상을 달성했습니다.

English

A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our analysis reveals that at the hidden-state level, exploration and exploitation could be decoupled (Sec. 4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.

탐색과 활용의 딜레마를 넘어서: RLVR에서의 LLM 추론을 위한 숨겨진 상태 접근법

Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

초록

Support