ChatPaper.aiChatPaper

AstraFlow:資料流導向的強化學習用於代理型LLM

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

May 15, 2026
作者: Haizhong Zheng, Yizhuo Di, Jiahui Wang, Shuowei Jin, Xueshen Liu, Yongji Wu, Z. Morley Mao, Ion Stoica, Jiawei Zhao, Beidi Chen
cs.AI

摘要

強化學習(Reinforcement Learning, RL)正被越來越多地用於提升大型語言模型在推理、編碼及工具使用方面的能力,但代理人強化學習(agentic RL)的成本仍然高得令人卻步。將RL擴展到代理人大型語言模型(agentic LLMs),需要支援複雜的工作負載,包括多策略協作訓練,同時有效利用彈性、異質及跨區域的運算資源。現有的LLM RL系統雖然支援其中部分功能,但每次新增擴展通常都需要專門的系統工程。這種負擔源於以訓練器為中心的控制架構,以及缺乏針對RL系統元件的原則性抽象。為解決這些限制,我們提出AstraFlow,一個以資料流為導向的RL系統,它用原則性的元件抽象取代傳統以訓練器為中心的控制。在AstraFlow中,滾動服務(rollout services)、資料流管理及訓練被解耦為自主元件,使系統能夠原生支援複雜的多策略代理人RL工作負載,並有效利用多樣的運算資源。我們在數學、程式碼、搜尋及AgentBench工作負載上評估AstraFlow,結果顯示同一系統無需系統層級的程式碼更改,即可支援多策略訓練、彈性擴展、異質跨區域執行及可組合的資料演算法。在多策略協作訓練中,AstraFlow在達到與現有RL系統相當或更佳準確度的同時,將訓練時間加速了2.7倍。
English
Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, enabling the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7x.