VLA-RFT：基于世界模拟器中已验证奖励的视觉-语言-动作强化微调

摘要

视觉-语言-动作（VLA）模型能够实现具身决策，但主要依赖于模仿学习，这导致了误差累积以及在分布变化下的鲁棒性较差。强化学习（RL）可以缓解这些问题，但通常需要昂贵的现实世界交互或面临模拟到现实的差距。我们提出了VLA-RFT，一种强化微调框架，它利用数据驱动的世界模型作为可控模拟器。该模拟器通过真实交互数据训练，能够预测基于动作的未来视觉观察，从而允许策略展开时获得密集的、源自目标达成参考的轨迹级奖励。这一设计提供了高效且与动作对齐的学习信号，大幅降低了样本需求。在不到400次微调步骤的情况下，VLA-RFT超越了强大的监督基线，并展现出比基于模拟器的RL更高的效率。此外，在扰动条件下，它表现出强大的鲁棒性，维持了任务的稳定执行。我们的研究结果确立了基于世界模型的RFT作为一种实用的后训练范式，能够增强VLA模型的泛化能力和鲁棒性。更多详情，请访问https://vla-rft.github.io/。

English

Vision-Language-Action (VLA) models enable embodied decision-making but rely heavily on imitation learning, leading to compounding errors and poor robustness under distribution shift. Reinforcement learning (RL) can mitigate these issues yet typically demands costly real-world interactions or suffers from sim-to-real gaps. We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory-level rewards derived from goal-achieving references. This design delivers an efficient and action-aligned learning signal, drastically lowering sample requirements. With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL. Moreover, it exhibits strong robustness under perturbed conditions, sustaining stable task execution. Our results establish world-model-based RFT as a practical post-training paradigm to enhance the generalization and robustness of VLA models. For more details, please refer to https://vla-rft.github.io/.

VLA-RFT：基于世界模拟器中已验证奖励的视觉-语言-动作强化微调

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

摘要

Support