テスト時強化学習のためのツール検証

要旨

テスト時強化学習（TTRL）は、大規模推論モデル（LRM）の自己進化のための有望なパラダイムとして登場し、多数決による自己誘導報酬を介して、ラベルなしテスト入力へのオンライン適応を可能にする。しかし、誤りを含むが高頻度で出現する未検証の合意が、偏った強化報酬信号となり、誤ったモード崩壊を引き起こす可能性がある。本研究では、この失敗モードをT^3RL（テスト時強化学習のためのツール検証）によって解決する。T^3RLは、報酬推定にテスト時ツール検証を導入する。具体的には、検証器が外部ツール（コード実行など）を証拠として用い、検証を考慮した投票において検証済みロールアウトの重みを上げることで、学習のためのより信頼性の高い擬似ラベルを生成する。様々な数学問題の難易度（MATH-500、AMC、AIME 2024）および多様なバックボーン種別において、T^3RLはTTRLを大幅に上回り、難易度の高い問題ほど改善幅が大きい。より広義には、T^3RLは検証済みオンラインデータ合成と見なすことができ、テスト時ツール検証が自己進化を安定化する鍵となるメカニズムであることを示唆する。

English

Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.

テスト時強化学習のためのツール検証

Tool Verification for Test-Time Reinforcement Learning

要旨

Support