ChatPaper.aiChatPaper

ArenaRL:通过基于锦标赛的相对排名实现开放智能体的强化学习规模化

ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking

January 10, 2026
作者: Qiang Zhang, Boli Chen, Fanrui Zhang, Ruixue Ding, Shihang Wang, Qiuchen Wang, Yinfeng Huang, Haonan Zhang, Rongxiang Zhu, Pengyong Wang, Ailin Ren, Xin Li, Pengjun Xie, Jiawei Liu, Ning Guo, Jingren Zhou, Zheng-Jun Zha
cs.AI

摘要

强化学习在可验证结果的任务上显著提升了LLM智能体的性能,但在具有广阔解空间的开放式智能体任务(如复杂旅行规划)中仍面临挑战。由于此类任务缺乏客观事实标准,当前RL算法主要依赖为单个响应分配标量分数的奖励模型。我们认为这种逐点评分存在固有的判别力坍塌缺陷:奖励模型难以区分不同轨迹间的细微优势,导致组内分数被压缩至狭窄区间。 consequently,有效奖励信号被奖励模型的噪声主导,引发优化停滞。为此,我们提出ArenaRL——一种从逐点标量评分转向组内相对排序的强化学习范式。ArenaRL引入过程感知的成对评估机制,采用多级量规为轨迹分配细粒度相对分数。同时,我们构建组内对抗竞技场并设计基于锦标赛的排序方案,以获取稳定的优势信号。实证结果表明,所构建的种子队单败淘汰方案在仅需O(N)复杂度的前提下,实现了与O(N²)复杂度的全成对比较近乎等效的优势估计精度,达成了效率与精度的最优平衡。此外,针对开放式智能体缺乏全周期基准测试的问题,我们构建了Open-Travel和Open-DeepResearch两个高质量基准,其完整流程覆盖SFT、RL训练和多维评估。大量实验表明,ArenaRL显著优于标准RL基线,能使LLM智能体为复杂现实任务生成更鲁棒的解决方案。
English
Reinforcement learning has substantially improved the performance of LLM agents on tasks with verifiable outcomes, but it still struggles on open-ended agent tasks with vast solution spaces (e.g., complex travel planning). Due to the absence of objective ground-truth for these tasks, current RL algorithms largely rely on reward models that assign scalar scores to individual responses. We contend that such pointwise scoring suffers from an inherent discrimination collapse: the reward model struggles to distinguish subtle advantages among different trajectories, resulting in scores within a group being compressed into a narrow range. Consequently, the effective reward signal becomes dominated by noise from the reward model, leading to optimization stagnation. To address this, we propose ArenaRL, a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking. ArenaRL introduces a process-aware pairwise evaluation mechanism, employing multi-level rubrics to assign fine-grained relative scores to trajectories. Additionally, we construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals. Empirical results confirm that the built seeded single-elimination scheme achieves nearly equivalent advantage estimation accuracy to full pairwise comparisons with O(N^2) complexity, while operating with only O(N) complexity, striking an optimal balance between efficiency and precision. Furthermore, to address the lack of full-cycle benchmarks for open-ended agents, we build Open-Travel and Open-DeepResearch, two high-quality benchmarks featuring a comprehensive pipeline covering SFT, RL training, and multi-dimensional evaluation. Extensive experiments show that ArenaRL substantially outperforms standard RL baselines, enabling LLM agents to generate more robust solutions for complex real-world tasks.
PDF301January 15, 2026