ChatPaper.aiChatPaper

VLA-RFT:在虛擬世界模擬器中基於驗證獎勵的視覺-語言-動作強化微調

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

October 1, 2025
作者: Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, Weihua Su
cs.AI

摘要

视觉-语言-动作(VLA)模型虽能实现具身决策,却过度依赖模仿学习,导致误差累积及在分布偏移下鲁棒性不足。强化学习(RL)虽可缓解这些问题,但通常需耗费大量真实世界交互或受限于仿真与现实的差距。我们提出VLA-RFT,一种强化微调框架,它利用数据驱动的世界模型作为可控仿真器。该仿真器基于真实交互数据训练,能预测未来视觉观测,条件于所采取的动作,从而允许策略展开,并基于目标达成参考生成密集的轨迹级奖励。这一设计提供了高效且与动作对齐的学习信号,大幅降低了样本需求。仅需不足400次微调步骤,VLA-RFT便超越了强监督基线,并展现出比基于仿真的RL更高的效率。此外,在扰动条件下,它表现出强大的鲁棒性,维持了任务的稳定执行。我们的研究确立了基于世界模型的RFT作为一种实用的后训练范式,以增强VLA模型的泛化能力和鲁棒性。更多详情,请访问https://vla-rft.github.io/。
English
Vision-Language-Action (VLA) models enable embodied decision-making but rely heavily on imitation learning, leading to compounding errors and poor robustness under distribution shift. Reinforcement learning (RL) can mitigate these issues yet typically demands costly real-world interactions or suffers from sim-to-real gaps. We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory-level rewards derived from goal-achieving references. This design delivers an efficient and action-aligned learning signal, drastically lowering sample requirements. With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL. Moreover, it exhibits strong robustness under perturbed conditions, sustaining stable task execution. Our results establish world-model-based RFT as a practical post-training paradigm to enhance the generalization and robustness of VLA models. For more details, please refer to https://vla-rft.github.io/.
PDF613October 2, 2025