推理競技場：當可驗證獎勵不足時的追蹤錦標賽

摘要

可驗證獎勵強化學習已成為透過結果導向監督來提升大型語言模型推理能力的主流範疇。然而，當所有針對特定提示的取樣軌跡獲得相同獎勵時，即便這些軌跡在推理品質上存在顯著差異，群體相對優勢估計仍無法提供梯度訊號，導致可驗證獎勵在群體層級經常失去資訊價值。為此，我們提出「推理競技場」——一個自適應訓練框架，能將這類獎勵多樣性不足的群體導向評判系統，而非直接捨棄。除了檢驗最終答案外，推理競技場建構軌跡錦標賽，將推理軌跡進行兩兩比較，藉此揭露群體內更細微的偏好差異，將推理品質轉化為豐富的相對獎勵訊號。為提升獎勵估計效率，我們無須窮舉所有配對，而是將每條新軌跡與一個動態更新的小型「參考軌跡池」中既有軌跡進行比較，從而高效建立相對排名。接著在非完整比較圖上擬合布拉德利-特里模型，實現無需二次配對比較的大規模強化學習整合。實驗結果顯示，在競賽數學與程式設計基準上，推理競技場平均表現穩定超越可驗證獎勵強化學習基準達7.6%。透過將原本無用的零優勢樣本轉化為有效梯度更新，我們的方法將訓練速度提升27%至41%，節省近50%的生成計算資源，並顯著提升整體推理表現。

English

Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.