穩定知識,促進推理:RLVR中的雙令牌約束機制
Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR
July 21, 2025
作者: Jiakang Wang, Runze Liu, Fuzheng Zhang, Xiu Li, Guorui Zhou
cs.AI
摘要
基於可驗證獎勵的強化學習(RLVR)已成為提升大型語言模型(LLMs)推理能力的有效後訓練方法,主要通過塑造如反思和規劃等高階行為來實現。然而,以往的RLVR算法通常對所有詞元施加統一的訓練信號,未考慮低熵知識相關詞元與高熵推理相關詞元的不同角色。一些近期方法嘗試通過梯度遮罩或異步更新來區分這些詞元類型,但這些做法可能破壞模型輸出中的語義依賴,阻礙有效學習。在本研究中,我們提出了Archer,一種具有雙重詞元約束和同步更新的熵感知RLVR方法。具體而言,我們的方法對推理詞元應用較弱的KL正則化和較高的裁剪閾值以鼓勵探索,同時對知識詞元施加更強的約束以保持事實知識。在多個數學推理和代碼生成基準上的實驗結果表明,我們的方法顯著優於先前的RLVR方法,在同等規模模型中達到或超越了最先進的性能。代碼可於https://github.com/wizard-III/ArcherCodeR 獲取。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective
post-training method for improving the reasoning abilities of Large Language
Models (LLMs), mainly by shaping higher-order behaviors such as reflection and
planning. However, previous RLVR algorithms often apply uniform training
signals to all tokens, without considering the different roles of low-entropy
knowledge-related tokens and high-entropy reasoning-related tokens. Some recent
methods try to separate these token types by gradient masking or asynchronous
updates, but these approaches may break semantic dependencies in the model
output and hinder effective learning. In this work, we propose Archer, an
entropy-aware RLVR approach with dual-token constraints and synchronous
updates. Specifically, our method applies weaker KL regularization and higher
clipping thresholds to reasoning tokens to encourage exploration, while using
stronger constraints on knowledge tokens to maintain factual knowledge.
Experimental results on several mathematical reasoning and code generation
benchmarks show that our approach significantly outperforms previous RLVR
methods, reaching or exceeding state-of-the-art performance among models of
comparable size. The code is available at
https://github.com/wizard-III/ArcherCodeR.