AstraFlow: 面向数据流的强化学习用于智能体大语言模型

摘要

强化学习（RL）正越来越多地被用于提升大语言模型的推理、编码和工具使用能力，但面向智能体的强化学习仍因成本过高而难以推广。将RL扩展到智能体LLM需要支持复杂的工作负载（包括多策略协同训练），同时高效利用弹性、异构及跨区域的计算资源。现有的LLM RL系统能够支持部分这些功能，但每项新的扩展往往都需要专门的系统工程。这一负担源于以训练器为中心的控制架构，以及RL系统组件缺乏原则性抽象。为解决这些限制，我们提出了AstraFlow——一种数据流导向的RL系统，它用原则性的组件抽象取代了传统的以训练器为中心的控制方式。在AstraFlow中，数据生成服务、数据流管理和训练被解耦为自治组件，使系统能够原生支持复杂的多策略智能体RL工作负载，并高效利用多样化的计算资源。我们在数学、代码、搜索和AgentBench工作负载上评估了AstraFlow，结果表明，同一系统无需系统级代码修改即可支持多策略训练、弹性扩展、异构跨区域执行以及可组合的数据算法。在多策略协同训练中，AstraFlow在达到与现有RL系统相当或更优精度的同时，将训练时间加速了2.7倍。

English

Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, enabling the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7x.