AstraFlow：資料流導向的強化學習用於代理型LLM

摘要

強化學習（Reinforcement Learning, RL）正被越來越多地用於提升大型語言模型在推理、編碼及工具使用方面的能力，但代理人強化學習（agentic RL）的成本仍然高得令人卻步。將RL擴展到代理人大型語言模型（agentic LLMs），需要支援複雜的工作負載，包括多策略協作訓練，同時有效利用彈性、異質及跨區域的運算資源。現有的LLM RL系統雖然支援其中部分功能，但每次新增擴展通常都需要專門的系統工程。這種負擔源於以訓練器為中心的控制架構，以及缺乏針對RL系統元件的原則性抽象。為解決這些限制，我們提出AstraFlow，一個以資料流為導向的RL系統，它用原則性的元件抽象取代傳統以訓練器為中心的控制。在AstraFlow中，滾動服務（rollout services）、資料流管理及訓練被解耦為自主元件，使系統能夠原生支援複雜的多策略代理人RL工作負載，並有效利用多樣的運算資源。我們在數學、程式碼、搜尋及AgentBench工作負載上評估AstraFlow，結果顯示同一系統無需系統層級的程式碼更改，即可支援多策略訓練、彈性擴展、異質跨區域執行及可組合的資料演算法。在多策略協作訓練中，AstraFlow在達到與現有RL系統相當或更佳準確度的同時，將訓練時間加速了2.7倍。

English

Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, enabling the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7x.