测试时强化学习的工具验证

摘要

测试时强化学习（TTRL）已成为实现大型推理模型（LRM）自我演进的重要范式，该技术通过多数表决产生自我诱导奖励，使模型能在未标注测试输入上实现在线适应。然而，高频出现但未经核实的虚假共识可能成为带有偏差的强化奖励信号，导致错误的模式坍塌。我们提出T³RL（测试时强化学习的工具验证机制）来解决这一失效模式，通过将测试时工具验证引入奖励估计过程。具体而言，验证器利用外部工具（如代码执行结果）作为证据，在验证感知投票中提升已验证推演的权重，从而为训练生成更可靠的伪标签。在多种数学难度数据集（MATH-500、AMC及AIME 2024）和不同骨干网络上的实验表明，T³RL较TTRL实现显著提升，且在越困难的问题上增益越大。更广泛地说，T³RL可被视为经过验证的在线数据合成方法，揭示了测试时工具验证作为稳定自我演进的关键机制。

English

Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.

测试时强化学习的工具验证

Tool Verification for Test-Time Reinforcement Learning

摘要

Support