ChatPaper.aiChatPaper

π_RL:基於流模型的視覺-語言-動作模型線上強化學習微調

π_RL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

October 29, 2025
作者: Kang Chen, Zhihao Liu, Tonghe Zhang, Zhen Guo, Si Xu, Hao Lin, Hongzhi Zang, Quanlu Zhang, Zhaofei Yu, Guoliang Fan, Tiejun Huang, Yu Wang, Chao Yu
cs.AI

摘要

視覺-語言-動作模型使機器人能夠根據多模態輸入理解並執行複雜任務。儘管近期研究探索使用強化學習自動化繁瑣的數據收集過程以擴展監督式微調,但由於基於流模型的VLA(如π₀、π₀.₅)會產生難以處理的迭代去噪動作對數似然,將大規模RL應用於此類模型仍具挑戰性。 我們提出開源框架π_RL來應對此挑戰,該框架支持在並行模擬環境中訓練基於流模型的VLA。π_RL實現了兩種RL算法:(1)「流噪聲」將去噪過程建模為離散時間馬可夫決策過程,通過可學習的噪聲網絡實現精確對數似然計算;(2)「流隨機微分方程」將去噪與智能體-環境交互整合,構建雙層MDP框架,利用常微分方程至隨機微分方程轉換實現高效RL探索。 我們在LIBERO與ManiSkill基準測試中評估π_RL。在LIBERO上,π_RL將少樣本SFT模型π₀和π₀.₅的性能分別從57.6%提升至97.6%、從77.1%提升至98.3%。在ManiSkill的4352個抓取放置任務中,通過320個並行模擬環境訓練,π_RL將π₀從41.6%提升至85.7%,π₀.₅從40.0%提升至84.8%,展現了異構模擬下可擴展的多任務RL能力。 總體而言,π_RL相較SFT模型實現了顯著性能提升與更強泛化能力,驗證了在線強化學習對於基於流模型的VLA的有效性。
English
Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying large-scale RL to flow-based VLAs (e.g., pi_0, pi_{0.5}) remains challenging due to intractable action log-likelihoods from iterative denoising. We address this challenge with pi_{RL}, an open-source framework for training flow-based VLAs in parallel simulation. pi_{RL} implements two RL algorithms: (1) {Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) {Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate pi_{RL} on LIBERO and ManiSkill benchmarks. On LIBERO, pi_{RL} boosts few-shot SFT models pi_0 and pi_{0.5} from 57.6% to 97.6% and from 77.1% to 98.3%, respectively. In ManiSkill, we train pi_{RL} in 320 parallel environments, improving pi_0 from 41.6% to 85.7% and pi_{0.5} from 40.0% to 84.8% across 4352 pick-and-place tasks, demonstrating scalable multitask RL under heterogeneous simulation. Overall, pi_{RL} achieves significant performance gains and stronger generalization over SFT-models, validating the effectiveness of online RL for flow-based VLAs.
PDF664February 7, 2026