ChatPaper.aiChatPaper

ECHO-2:面向高效成本强化学习的大规模分布式部署框架

ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient Reinforcement Learning

February 2, 2026
作者: Jie Xiao, Meng Chen, Qingnan Ren, Jingwei Song, Jiaqi Huang, Yangshen Deng, Chris Tong, Wanyi Chen, Suli Wang, Ziqian Bi, Shuo Lu, Yiqun Duan, Xu Wang, Rymon Yu, Ween Yang, Lynn Ai, Eric Yang, Bill Shi, Song Jingwei
cs.AI

摘要

强化学习(RL)是大语言模型(LLM)后训练过程中的关键阶段,涉及策略生成、奖励评估与集中式学习的循环交互。分布式策略执行虽能利用更具成本效益的推理资源,但会引入广域协调与策略分发的挑战。我们提出ECHO-2这一面向远程推理工作节点且具有不可忽略分发延迟的分布式RL后训练框架。该框架将集中式学习与分布式策略生成相结合,将有界策略滞后作为用户可控参数,实现策略生成、分发与训练的流水线并行。我们建立了基于重叠度的容量模型,关联训练时间、分发延迟与策略吞吐量,形成维持学习器利用率的高效资源配置规则。为缓解分发瓶颈并降低成本,ECHO-2采用对等辅助流水线广播及异构工作节点的成本感知激活机制。在真实广域带宽环境下对40亿和80亿参数模型进行GRPO后训练的实验表明,ECHO-2在保持与强基线相当RL奖励的同时,显著提升了成本效率。
English
Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers opportunities to leverage more cost-efficient inference resources, but introduces challenges in wide-area coordination and policy dissemination. We present ECHO-2, a distributed RL framework for post-training with remote inference workers and non-negligible dissemination latency. ECHO-2 combines centralized learning with distributed rollouts and treats bounded policy staleness as a user-controlled parameter, enabling rollout generation, dissemination, and training to overlap. We introduce an overlap-based capacity model that relates training time, dissemination latency, and rollout throughput, yielding a practical provisioning rule for sustaining learner utilization. To mitigate dissemination bottlenecks and lower cost, ECHO-2 employs peer-assisted pipelined broadcast and cost-aware activation of heterogeneous workers. Experiments on GRPO post-training of 4B and 8B models under real wide-area bandwidth regimes show that ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines.
PDF61February 13, 2026