AstraFlow: 面向数据流的强化学习用于智能体大语言模型
AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs
May 15, 2026
作者: Haizhong Zheng, Yizhuo Di, Jiahui Wang, Shuowei Jin, Xueshen Liu, Yongji Wu, Z. Morley Mao, Ion Stoica, Jiawei Zhao, Beidi Chen
cs.AI
摘要
强化学习(RL)正越来越多地被用于提升大语言模型的推理、编码和工具使用能力,但面向智能体的强化学习仍因成本过高而难以推广。将RL扩展到智能体LLM需要支持复杂的工作负载(包括多策略协同训练),同时高效利用弹性、异构及跨区域的计算资源。现有的LLM RL系统能够支持部分这些功能,但每项新的扩展往往都需要专门的系统工程。这一负担源于以训练器为中心的控制架构,以及RL系统组件缺乏原则性抽象。为解决这些限制,我们提出了AstraFlow——一种数据流导向的RL系统,它用原则性的组件抽象取代了传统的以训练器为中心的控制方式。在AstraFlow中,数据生成服务、数据流管理和训练被解耦为自治组件,使系统能够原生支持复杂的多策略智能体RL工作负载,并高效利用多样化的计算资源。我们在数学、代码、搜索和AgentBench工作负载上评估了AstraFlow,结果表明,同一系统无需系统级代码修改即可支持多策略训练、弹性扩展、异构跨区域执行以及可组合的数据算法。在多策略协同训练中,AstraFlow在达到与现有RL系统相当或更优精度的同时,将训练时间加速了2.7倍。
English
Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, enabling the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7x.