测试时强化学习的工具验证
Tool Verification for Test-Time Reinforcement Learning
March 2, 2026
作者: Ruotong Liao, Nikolai Röhrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh, Volker Tresp, Serena Yeung-Levy
cs.AI
摘要
测试时强化学习(TTRL)已成为实现大型推理模型(LRM)自我演进的重要范式,该技术通过多数表决产生自我诱导奖励,使模型能在未标注测试输入上实现在线适应。然而,高频出现但未经核实的虚假共识可能成为带有偏差的强化奖励信号,导致错误的模式坍塌。我们提出T³RL(测试时强化学习的工具验证机制)来解决这一失效模式,通过将测试时工具验证引入奖励估计过程。具体而言,验证器利用外部工具(如代码执行结果)作为证据,在验证感知投票中提升已验证推演的权重,从而为训练生成更可靠的伪标签。在多种数学难度数据集(MATH-500、AMC及AIME 2024)和不同骨干网络上的实验表明,T³RL较TTRL实现显著提升,且在越困难的问题上增益越大。更广泛地说,T³RL可被视为经过验证的在线数据合成方法,揭示了测试时工具验证作为稳定自我演进的关键机制。
English
Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.