测试时强化学习的工具验证
Tool Verification for Test-Time Reinforcement Learning
March 2, 2026
作者: Ruotong Liao, Nikolai Röhrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh, Volker Tresp, Serena Yeung-Levy
cs.AI
摘要
测试时强化学习(TTRL)已成为一种推动大型推理模型(LRM)自我演进的前沿范式,该范式通过多数投票产生自诱导奖励,使模型能在未标注测试输入上实现在线适应。然而,高频但虚假的未经验证共识可能形成带有偏见的强化奖励信号,导致错误的模式坍塌。针对这一失效模式,我们提出T^3RL(测试时工具验证强化学习),将测试阶段工具验证机制引入奖励估计过程。具体而言,验证器利用外部工具(如代码执行结果)作为证据,在验证感知投票中提升已验证决策轨迹的权重,从而为训练生成更可靠的伪标签。在多种数学难度数据集(MATH-500、AMC及AIME 2024)和不同骨干网络上的实验表明,T^3RL较TTRL实现显著提升,且在难题上增益更为突出。从更广义视角看,T^3RL可视为经过验证的在线数据合成方法,揭示了测试时工具验证作为稳定模型自我演进的关键机制。
English
Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.