RLinf-USER:面向具身智能现实场景在线策略学习的统一可扩展系统
RLinf-USER: A Unified and Extensible System for Real-World Online Policy Learning in Embodied AI
February 8, 2026
作者: Hongzhi Zang, Shu'ang Yu, Hao Lin, Tianxing Zhou, Zefang Huang, Zhen Guo, Xin Xu, Jiakai Zhou, Yuze Sheng, Shizhe Zhang, Feng Gao, Wenhao Tang, Yufeng Yue, Quanlu Zhang, Xinlei Chen, Chao Yu, Yu Wang
cs.AI
摘要
在线策略学习直接在物理世界中执行是具身智能领域一个前景广阔但充满挑战的方向。与仿真环境不同,现实世界系统无法任意加速、低成本重置或大规模复制,这使得可扩展的数据收集、异构部署和长周期有效训练变得困难。这些挑战表明现实世界的策略学习不仅是算法问题,更本质上是系统问题。我们提出USER——一个面向现实世界在线策略学习的统一可扩展系统。USER通过统一的硬件抽象层将物理机器人视为与GPU同等重要的一级硬件资源,实现异构机器人的自动发现、管理与调度。针对云边通信问题,USER引入具有隧道化网络的自适应通信平面,通过分布式数据通道实现流量本地化,并采用流式多处理器感知的权重同步机制来调控GPU端开销。在此基础设施之上,USER将学习组织为具有持久化缓存感知缓冲区的全异步框架,支持具备鲁棒崩溃恢复能力和历史数据复用功能的长周期实验。此外,USER为奖励函数、算法和策略提供可扩展抽象,支持在统一流水线中实现在线模仿学习或强化学习,涵盖CNN/MLP、生成式策略及大型视觉-语言-动作模型。仿真与真实环境实验表明,USER能够实现多机器人协同、异构机械臂控制、大模型云边协同以及长时异步训练,为现实世界在线策略学习提供了统一可扩展的系统基础。
English
Online policy learning directly in the physical world is a promising yet challenging direction for embodied intelligence. Unlike simulation, real-world systems cannot be arbitrarily accelerated, cheaply reset, or massively replicated, which makes scalable data collection, heterogeneous deployment, and long-horizon effective training difficult. These challenges suggest that real-world policy learning is not only an algorithmic issue but fundamentally a systems problem. We present USER, a Unified and extensible SystEm for Real-world online policy learning. USER treats physical robots as first-class hardware resources alongside GPUs through a unified hardware abstraction layer, enabling automatic discovery, management, and scheduling of heterogeneous robots. To address cloud-edge communication, USER introduces an adaptive communication plane with tunneling-based networking, distributed data channels for traffic localization, and streaming-multiprocessor-aware weight synchronization to regulate GPU-side overhead. On top of this infrastructure, USER organizes learning as a fully asynchronous framework with a persistent, cache-aware buffer, enabling efficient long-horizon experiments with robust crash recovery and reuse of historical data. In addition, USER provides extensible abstractions for rewards, algorithms, and policies, supporting online imitation or reinforcement learning of CNN/MLP, generative policies, and large vision-language-action (VLA) models within a unified pipeline. Results in both simulation and the real world show that USER enables multi-robot coordination, heterogeneous manipulators, edge-cloud collaboration with large models, and long-running asynchronous training, offering a unified and extensible systems foundation for real-world online policy learning.