RLinf-Co：基于强化学习的视觉语言动作模型虚实协同训练框架

摘要

仿真技术为丰富视觉-语言-动作模型的训练提供了可扩展且低成本的途径，降低了对于昂贵真实机器人演示数据的依赖。然而，多数虚实协同训练方法依赖于监督微调，仅将仿真视为静态演示数据源，未能充分利用大规模闭环交互的优势，导致现实场景的性能增益和泛化能力受限。本文提出一种基于强化学习的虚实协同训练框架，在保留真实世界能力的同时充分利用交互式仿真的潜力。该方法采用通用的两阶段设计：首先通过真实与仿真演示数据的混合监督微调对策略进行预热初始化，随后在仿真环境中进行强化学习微调，并通过对真实数据施加辅助监督损失来锚定策略、规避灾难性遗忘。我们在四种真实世界桌面操作任务上，使用OpenVLA和π_{0.5}两种代表性VLA架构进行评估，结果表明相较于纯真实数据微调和基于监督微调的协同训练，本方法实现了性能的持续提升——OpenVLA模型真实任务成功率提升24%，π_{0.5}模型提升20%。除成功率提升外，强化学习协同训练还展现出对未见任务变体更强的泛化能力，并显著提高了真实世界数据利用效率，为借助仿真技术增强机器人实际部署能力提供了实用且可扩展的路径。

English

Simulation offers a scalable and low-cost way to enrich vision-language-action (VLA) training, reducing reliance on expensive real-robot demonstrations. However, most sim-real co-training methods rely on supervised fine-tuning (SFT), which treats simulation as a static source of demonstrations and does not exploit large-scale closed-loop interaction. Consequently, real-world gains and generalization are often limited. In this paper, we propose an \textit{RL}-based sim-real \textit{Co}-training (RL-Co) framework that leverages interactive simulation while preserving real-world capabilities. Our method follows a generic two-stage design: we first warm-start the policy with SFT on a mixture of real and simulated demonstrations, then fine-tune it with reinforcement learning in simulation while adding an auxiliary supervised loss on real-world data to anchor the policy and mitigate catastrophic forgetting. We evaluate our framework on four real-world tabletop manipulation tasks using two representative VLA architectures, OpenVLA and π_{0.5}, and observe consistent improvements over real-only fine-tuning and SFT-based co-training, including +24% real-world success on OpenVLA and +20% on π_{0.5}. Beyond higher success rates, RL co-training yields stronger generalization to unseen task variations and substantially improved real-world data efficiency, providing a practical and scalable pathway for leveraging simulation to enhance real-robot deployment.

RLinf-Co：基于强化学习的视觉语言动作模型虚实协同训练框架

RLinf-Co: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models

摘要

Support