지식 안정화와 추론 촉진: RLVR을 위한 이중 토큰 제약 기법

초록

검증 가능한 보상을 활용한 강화 학습(Reinforcement Learning with Verifiable Rewards, RLVR)은 대규모 언어 모델(Large Language Models, LLMs)의 추론 능력을 향상시키기 위한 효과적인 사후 훈련 방법으로 자리 잡았으며, 주로 반성 및 계획과 같은 고차원적 행동을 형성하는 데 중점을 둡니다. 그러나 기존의 RLVR 알고리즘은 종종 모든 토큰에 동일한 훈련 신호를 적용하여, 낮은 엔트로피를 가지는 지식 관련 토큰과 높은 엔트로피를 가지는 추론 관련 토큰의 서로 다른 역할을 고려하지 않았습니다. 최근 일부 방법은 그래디언트 마스킹 또는 비동기적 업데이트를 통해 이러한 토큰 유형을 분리하려고 시도했지만, 이러한 접근 방식은 모델 출력에서의 의미적 의존성을 깨뜨리고 효과적인 학습을 방해할 수 있습니다. 본 연구에서는 이중 토큰 제약과 동기적 업데이트를 통해 엔트로피를 고려한 RLVR 접근 방식인 Archer를 제안합니다. 구체적으로, 우리의 방법은 추론 토큰에 대해 더 약한 KL 정규화와 더 높은 클리핑 임계값을 적용하여 탐색을 촉진하는 한편, 지식 토큰에 대해 더 강한 제약을 사용하여 사실적 지식을 유지합니다. 여러 수학적 추론 및 코드 생성 벤치마크에서의 실험 결과는 우리의 접근 방식이 이전의 RLVR 방법들을 크게 능가하며, 비슷한 규모의 모델 중에서 최첨단 성능에 도달하거나 이를 초과함을 보여줍니다. 코드는 https://github.com/wizard-III/ArcherCodeR에서 확인할 수 있습니다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method for improving the reasoning abilities of Large Language Models (LLMs), mainly by shaping higher-order behaviors such as reflection and planning. However, previous RLVR algorithms often apply uniform training signals to all tokens, without considering the different roles of low-entropy knowledge-related tokens and high-entropy reasoning-related tokens. Some recent methods try to separate these token types by gradient masking or asynchronous updates, but these approaches may break semantic dependencies in the model output and hinder effective learning. In this work, we propose Archer, an entropy-aware RLVR approach with dual-token constraints and synchronous updates. Specifically, our method applies weaker KL regularization and higher clipping thresholds to reasoning tokens to encourage exploration, while using stronger constraints on knowledge tokens to maintain factual knowledge. Experimental results on several mathematical reasoning and code generation benchmarks show that our approach significantly outperforms previous RLVR methods, reaching or exceeding state-of-the-art performance among models of comparable size. The code is available at https://github.com/wizard-III/ArcherCodeR.

지식 안정화와 추론 촉진: RLVR을 위한 이중 토큰 제약 기법

Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

초록

Support