RLinf-USER:一個面向具身人工智慧現實世界線上策略學習的統一可擴充系統
RLinf-USER: A Unified and Extensible System for Real-World Online Policy Learning in Embodied AI
February 8, 2026
作者: Hongzhi Zang, Shu'ang Yu, Hao Lin, Tianxing Zhou, Zefang Huang, Zhen Guo, Xin Xu, Jiakai Zhou, Yuze Sheng, Shizhe Zhang, Feng Gao, Wenhao Tang, Yufeng Yue, Quanlu Zhang, Xinlei Chen, Chao Yu, Yu Wang
cs.AI
摘要
在物理世界中直接進行線上策略學習是具身智慧一個前景廣闊但充滿挑戰的方向。與模擬環境不同,現實世界系統無法任意加速、低成本重置或大規模複製,這使得可擴展的數據收集、異構部署和長時程有效訓練變得困難。這些挑戰表明現實世界的策略學習不僅是算法問題,更本質上是系統工程問題。我們提出USER——一個面向現實世界線上策略學習的統一可擴展系統。USER通過統一的硬體抽象層將實體機器人視作與GPU並列的一等硬體資源,實現異構機器人的自動發現、管理與調度。為解決雲邊通信問題,USER引入具備隧道化網絡的自適應通信平面,通過分佈式數據通道實現流量本地化,並採用流式多處理器感知的權重同步機制來調控GPU端開銷。在此基礎設施之上,USER將學習組織為具備持久化緩存感知緩衝區的全異步框架,支持具有強健崩潰恢復能力和歷史數據復用機制的長時程實驗。此外,USER為獎勵函數、算法和策略提供可擴展抽象,在統一流水線中支持CNN/MLP、生成式策略及大型視覺-語言-動作模型的線上模仿學習或強化學習。模擬與現實場景實驗表明,USER可實現多機器人協同、異構機械臂協作、大模型驅動的邊雲協同以及長時異步訓練,為現實世界線上策略學習提供了統一可擴展的系統基礎。
English
Online policy learning directly in the physical world is a promising yet challenging direction for embodied intelligence. Unlike simulation, real-world systems cannot be arbitrarily accelerated, cheaply reset, or massively replicated, which makes scalable data collection, heterogeneous deployment, and long-horizon effective training difficult. These challenges suggest that real-world policy learning is not only an algorithmic issue but fundamentally a systems problem. We present USER, a Unified and extensible SystEm for Real-world online policy learning. USER treats physical robots as first-class hardware resources alongside GPUs through a unified hardware abstraction layer, enabling automatic discovery, management, and scheduling of heterogeneous robots. To address cloud-edge communication, USER introduces an adaptive communication plane with tunneling-based networking, distributed data channels for traffic localization, and streaming-multiprocessor-aware weight synchronization to regulate GPU-side overhead. On top of this infrastructure, USER organizes learning as a fully asynchronous framework with a persistent, cache-aware buffer, enabling efficient long-horizon experiments with robust crash recovery and reuse of historical data. In addition, USER provides extensible abstractions for rewards, algorithms, and policies, supporting online imitation or reinforcement learning of CNN/MLP, generative policies, and large vision-language-action (VLA) models within a unified pipeline. Results in both simulation and the real world show that USER enables multi-robot coordination, heterogeneous manipulators, edge-cloud collaboration with large models, and long-running asynchronous training, offering a unified and extensible systems foundation for real-world online policy learning.