推論アリーナ：検証可能な報酬が及ばない場合のトレーストーナメント

要旨

検証可能な報酬による強化学習（RLVR）は、大規模言語モデルの推論能力を結果ベースの監督を通じて向上させる主要なパラダイムとなっている。しかし、検証可能な報酬はグループレベルで無情報になりがちである。すなわち、あるプロンプトに対してサンプリングされたすべてのトレースが同一の報酬を受け取る場合、グループ相対アドバンテージ推定では勾配信号が得られず、各トレースの推論品質が大きく異なる可能性があるにもかかわらずである。本稿では、このような非多様な報酬グループを破棄する代わりに審査システムに誘導する適応型学習フレームワーク「Reasoning Arena」を提案する。最終回答の検証に加え、Reasoning Arenaはトレーストーナメントを構築し、トレース同士を一対一で比較することでグループ内のより詳細な選好を明らかにし、推論品質を豊かな相対報酬信号に変換する。報酬推定を効率的にするため、すべてのペアを網羅的に比較するのではなく、新たなトレースを動的に更新される小さな事前生成トレース集合（アンカー）と比較することで、相対順位を効率的に確立する。そして、不完全な比較グラフに対してブラッドリー・テリー模型を適用することで、二次のペアワイズ比較を行わずに拡張可能なRL統合を実現する。実験結果は、Reasoning Arenaが競技数学およびコーディングベンチマークにおいて平均7.6%の性能向上をRLVRベースラインに対して一貫してもたらすことを示している。従来は無駄になっていたゼロアドバンテージサンプルを有用な勾配更新に変換することで、本手法は学習速度を27%から41%加速し、生成計算をほぼ50%削減するとともに、全体的な推論性能を大幅に改善する。

English

Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.