推理竞技场:可验证奖励不足时的轨迹锦标赛
Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short
June 8, 2026
作者: Han Zhou, Adam X. Yang, Laurence Aitchison, Anna Korhonen, Albert Q. Jiang
cs.AI
摘要
基于可验证奖励的强化学习(RLVR)已成为通过基于结果的监督提升大语言模型推理能力的主流范式。然而,可验证奖励常在群体层面上变得信息量不足:当针对同一提示的所有采样推理路径获得相同奖励时,组间相对优势估计无法提供梯度信号,尽管这些推理路径的推理质量可能差异显著。为此,我们提出推理竞技场(Reasoning Arena),一种自适应训练框架,它将这类无差异奖励组引导至裁判系统而非直接丢弃。该框架不仅检查最终答案,还构建推理路径锦标赛,通过让推理路径进行两两比较来揭示组内更细粒度的偏好,从而将推理质量转化为丰富的相对奖励信号。为了高效地进行奖励估计,我们避免穷举所有路径对,而是将每条新生成的推理路径与一个动态更新的小型锚点池(由先前路径组成)进行比较,以高效建立相对排名。随后,我们在不完全比较图上拟合Bradley-Terry模型,实现无需二次型成对比较的可扩展强化学习集成。实验结果表明,在竞赛数学与编程基准测试中,推理竞技场平均比RLVR基线高出7.6%。通过将原本无用的零优势样本转化为有效梯度更新,我们的方法将训练速度提升27%至41%,节省近50%的生成计算量,并显著提升整体推理性能。
English
Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.