시험 시간 강화 학습을 위한 도구 검증

초록

시험 시간 강화 학습(TTRL)은 다수결 투표를 통한 자체 유도 보상을 통해 레이블이 없는 시험 입력에서 온라인 적응을 가능하게 함으로써 자기 진화형 대규모 추론 모델(LRM)을 위한 유망한 패러다임으로 부상했습니다. 그러나 허위이면서도 빈도가 높은 검증되지 않은 합의는 편향되고 강화된 보상 신호가 되어 잘못된 모드 붕괴를 초래할 수 있습니다. 우리는 이러한 실패 모드를 T^3RL(시험 시간 도구 검증 강화 학습)로 해결하며, 보상 추정에 시험 시간 도구 검증을 도입합니다. 구체적으로, 검증자는 외부 도구를 증거(예: 코드 실행 결과)로 활용하여 검증 인식 투표에서 검증된 롤아웃의 가중치를 높여, 학습을 위한 더 신뢰할 수 있는 의사 레이블을 생성합니다. 다양한 수학 문제 난이도(MATH-500, AMC, AIME 2024)와 다양한 백본 유형에서 T^3RL은 TTRL을 크게 개선했으며, 더 어려운 문제에서 더 큰 성능 향상을 보였습니다. 더 넓게 보면, T^3RL은 검증된 온라인 데이터 합성으로 볼 수 있으며, 시험 시간 도구 검증이 자기 진화를 안정화하는 핵심 메커니즘임을 강조합니다.

English

Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.

시험 시간 강화 학습을 위한 도구 검증

Tool Verification for Test-Time Reinforcement Learning

초록

Support