추론 아레나: 검증 가능한 보상이 부족할 때의 추적 토너먼트

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 결과 기반 감독을 통해 대규모 언어 모델의 추론 능력을 향상시키는 주요 패러다임으로 자리 잡았다. 그러나 검증 가능한 보상은 집단 수준에서 정보를 제공하지 못하는 경우가 빈번하다. 특정 프롬프트에서 샘플링된 모든 추적이 동일한 보상을 받으면, 추적의 추론 품질이 상당히 다름에도 불구하고 집단 상대적 이점 추정은 기울기 신호를 제공하지 않는다. 본 논문에서는 이러한 비다양성 보상 집단을 폐기하는 대신, 심사 시스템으로 라우팅하는 적응형 훈련 프레임워크인 Reasoning Arena를 제안한다. Reasoning Arena는 최종 답변을 검토하는 것을 넘어, 추적 토너먼트를 구성하여 추론 추적들을 일대일로 비교함으로써 집단 내에서 더 세분화된 선호도를 드러내고, 추론 품질을 풍부한 상대적 보상 신호로 변환한다. 보상 추정의 효율성을 위해, 모든 쌍을 완전히 비교하는 대신, 각각의 새로운 추적은 이전에 생성된 추적들로 구성된 소규모의 동적으로 업데이트되는 풀을 앵커로 삼아 평가함으로써 효율적으로 상대적 순위를 설정한다. 그런 다음 불완전한 비교 그래프에 Bradley-Terry 모델을 적용하여, 이차적인 쌍별 비교 없이 확장 가능한 강화 학습 통합을 가능하게 한다. 실험 결과, Reasoning Arena는 수학 경쟁 및 코딩 벤치마크에서 RLVR 기준선보다 평균 7.6% 더 높은 성능을 일관되게 보여준다. 본 방법은 그렇지 않으면 낭비될 제로 이점 샘플을 유용한 기울기 업데이트로 변환함으로써, 훈련 속도를 27%에서 41%까지 가속화하고, 생성 연산의 약 50%를 절약하며, 전반적인 추론 성능을 크게 향상시킨다.

English

Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.