稳固知识,促进推理:面向RLVR的双重令牌约束机制
Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR
July 21, 2025
作者: Jiakang Wang, Runze Liu, Fuzheng Zhang, Xiu Li, Guorui Zhou
cs.AI
摘要
基于可验证奖励的强化学习(RLVR)已成为提升大型语言模型(LLMs)推理能力的有效后训练方法,主要通过塑造如反思与规划等高阶行为来实现。然而,先前的RLVR算法往往对所有令牌施加统一的训练信号,未考虑低熵知识相关令牌与高熵推理相关令牌的不同角色。近期一些方法尝试通过梯度掩码或异步更新来区分这些令牌类型,但这些做法可能破坏模型输出中的语义依赖关系,阻碍有效学习。本研究中,我们提出了Archer,一种具备双令牌约束与同步更新的熵感知RLVR方法。具体而言,我们的方法对推理令牌采用较弱的KL正则化与较高的裁剪阈值以鼓励探索,同时对知识令牌施加更强约束以保持事实知识的准确性。在多个数学推理与代码生成基准测试上的实验结果表明,我们的方法显著超越了以往的RLVR方法,在同等规模模型中达到或超越了最先进的性能。代码已发布于https://github.com/wizard-III/ArcherCodeR。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective
post-training method for improving the reasoning abilities of Large Language
Models (LLMs), mainly by shaping higher-order behaviors such as reflection and
planning. However, previous RLVR algorithms often apply uniform training
signals to all tokens, without considering the different roles of low-entropy
knowledge-related tokens and high-entropy reasoning-related tokens. Some recent
methods try to separate these token types by gradient masking or asynchronous
updates, but these approaches may break semantic dependencies in the model
output and hinder effective learning. In this work, we propose Archer, an
entropy-aware RLVR approach with dual-token constraints and synchronous
updates. Specifically, our method applies weaker KL regularization and higher
clipping thresholds to reasoning tokens to encourage exploration, while using
stronger constraints on knowledge tokens to maintain factual knowledge.
Experimental results on several mathematical reasoning and code generation
benchmarks show that our approach significantly outperforms previous RLVR
methods, reaching or exceeding state-of-the-art performance among models of
comparable size. The code is available at
https://github.com/wizard-III/ArcherCodeR.