ChatPaper.aiChatPaper

π_RL:基于流模型的视觉-语言-动作在线强化学习微调框架

π_RL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

October 29, 2025
作者: Kang Chen, Zhihao Liu, Tonghe Zhang, Zhen Guo, Si Xu, Hao Lin, Hongzhi Zang, Quanlu Zhang, Zhaofei Yu, Guoliang Fan, Tiejun Huang, Yu Wang, Chao Yu
cs.AI

摘要

视觉-语言-动作(VLA)模型使机器人能够通过多模态输入理解并执行复杂任务。尽管近期研究探索利用强化学习(RL)来自动化监督微调(SFT)扩展过程中繁琐的数据收集流程,但由于基于流的VLA模型(如π₀、π₀.₅)在迭代去噪过程中存在难以处理的动作对数似然,将大规模RL应用于此类模型仍具挑战性。 我们提出π_{RL}——一个专为并行仿真训练基于流的VLA模型设计的开源框架来解决这一难题。该框架实现两种RL算法:(1){Flow-Noise}将去噪过程建模为离散时间马尔可夫决策过程,通过可学习的噪声网络实现精确对数似然计算;(2){Flow-SDE}将去噪与智能体-环境交互相结合,构建双层马尔可夫决策过程,采用常微分方程-随机微分方程转换实现高效RL探索。 我们在LIBERO和ManiSkill基准测试中评估π_{RL}。在LIBERO上,π_{RL}将小样本SFT模型π₀和π₀.₅的性能分别从57.6%提升至97.6%、从77.1%提升至98.3%。在ManiSkill的4352项抓放任务中,我们通过320个并行仿真环境训练π_{RL},使π₀从41.6%提升至85.7%,π₀.₅从40.0%提升至84.8%,展现了异构仿真环境下可扩展的多任务RL能力。 总体而言,π_{RL}相较SFT模型实现了显著性能提升和更强泛化能力,验证了在线强化学习对于基于流的VLA模型的有效性。
English
Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying large-scale RL to flow-based VLAs (e.g., pi_0, pi_{0.5}) remains challenging due to intractable action log-likelihoods from iterative denoising. We address this challenge with pi_{RL}, an open-source framework for training flow-based VLAs in parallel simulation. pi_{RL} implements two RL algorithms: (1) {Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) {Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate pi_{RL} on LIBERO and ManiSkill benchmarks. On LIBERO, pi_{RL} boosts few-shot SFT models pi_0 and pi_{0.5} from 57.6% to 97.6% and from 77.1% to 98.3%, respectively. In ManiSkill, we train pi_{RL} in 320 parallel environments, improving pi_0 from 41.6% to 85.7% and pi_{0.5} from 40.0% to 84.8% across 4352 pick-and-place tasks, demonstrating scalable multitask RL under heterogeneous simulation. Overall, pi_{RL} achieves significant performance gains and stronger generalization over SFT-models, validating the effectiveness of online RL for flow-based VLAs.
PDF664February 7, 2026