测试时强化学习的工具验证

摘要

测试时强化学习（TTRL）已成为一种推动大型推理模型（LRM）自我演进的前沿范式，该范式通过多数投票产生自诱导奖励，使模型能在未标注测试输入上实现在线适应。然而，高频但虚假的未经验证共识可能形成带有偏见的强化奖励信号，导致错误的模式坍塌。针对这一失效模式，我们提出T^3RL（测试时工具验证强化学习），将测试阶段工具验证机制引入奖励估计过程。具体而言，验证器利用外部工具（如代码执行结果）作为证据，在验证感知投票中提升已验证决策轨迹的权重，从而为训练生成更可靠的伪标签。在多种数学难度数据集（MATH-500、AMC及AIME 2024）和不同骨干网络上的实验表明，T^3RL较TTRL实现显著提升，且在难题上增益更为突出。从更广义视角看，T^3RL可视为经过验证的在线数据合成方法，揭示了测试时工具验证作为稳定模型自我演进的关键机制。

English

Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.