ChatPaper.aiChatPaper

ArenaRL:透過錦標賽相對排名機制實現開放式智能體的強化學習規模化

ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking

January 10, 2026
作者: Qiang Zhang, Boli Chen, Fanrui Zhang, Ruixue Ding, Shihang Wang, Qiuchen Wang, Yinfeng Huang, Haonan Zhang, Rongxiang Zhu, Pengyong Wang, Ailin Ren, Xin Li, Pengjun Xie, Jiawei Liu, Ning Guo, Jingren Zhou, Zheng-Jun Zha
cs.AI

摘要

強化學習在具有可驗證結果的任務上顯著提升了大型語言模型代理的表現,但在解決空間廣闊的開放式代理任務(如複雜旅行規劃)時仍面臨挑戰。由於此類任務缺乏客觀的基準答案,現有的強化學習算法主要依賴於對單個回應給出標量分數的獎勵模型。我們認為這種逐點評分方式存在固有的判別力坍塌問題:獎勵模型難以區分不同軌跡間的細微優勢,導致同組內分數被壓縮至狹窄區間。其結果是,有效獎勵信號被獎勵模型中的噪聲主導,從而引發優化停滯。為解決此問題,我們提出ArenaRL強化學習範式,將逐點標量評分轉變為組內相對排序。ArenaRL引入過程感知的配對評估機制,採用多級評分標準為軌跡分配細粒度相對分數。此外,我們構建了組內對抗競技場,設計基於錦標賽的排序方案以獲取穩定的優勢信號。實驗結果證實,所構建的種子隊單敗淘汰方案在僅需O(N)複雜度的情況下,實現了與O(N²)複雜度的全配對比較幾乎等效的優勢估計精度,在效率與精確度間達到最優平衡。針對開放式代理缺乏全週期基準測試的問題,我們還構建了Open-Travel與Open-DeepResearch兩個高質量基準平台,其完整流程覆蓋監督微調、強化訓練與多維度評估。大量實驗表明,ArenaRL顯著優於標準強化學習基線,使大型語言模型代理能為複雜現實任務生成更魯棒的解決方案。
English
Reinforcement learning has substantially improved the performance of LLM agents on tasks with verifiable outcomes, but it still struggles on open-ended agent tasks with vast solution spaces (e.g., complex travel planning). Due to the absence of objective ground-truth for these tasks, current RL algorithms largely rely on reward models that assign scalar scores to individual responses. We contend that such pointwise scoring suffers from an inherent discrimination collapse: the reward model struggles to distinguish subtle advantages among different trajectories, resulting in scores within a group being compressed into a narrow range. Consequently, the effective reward signal becomes dominated by noise from the reward model, leading to optimization stagnation. To address this, we propose ArenaRL, a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking. ArenaRL introduces a process-aware pairwise evaluation mechanism, employing multi-level rubrics to assign fine-grained relative scores to trajectories. Additionally, we construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals. Empirical results confirm that the built seeded single-elimination scheme achieves nearly equivalent advantage estimation accuracy to full pairwise comparisons with O(N^2) complexity, while operating with only O(N) complexity, striking an optimal balance between efficiency and precision. Furthermore, to address the lack of full-cycle benchmarks for open-ended agents, we build Open-Travel and Open-DeepResearch, two high-quality benchmarks featuring a comprehensive pipeline covering SFT, RL training, and multi-dimensional evaluation. Extensive experiments show that ArenaRL substantially outperforms standard RL baselines, enabling LLM agents to generate more robust solutions for complex real-world tasks.
PDF301January 15, 2026