VLA-RFT: 세계 시뮬레이터에서 검증된 보상을 활용한 시각-언어-행동 강화 미세 조정

초록

Vision-Language-Action (VLA) 모델은 구체화된 의사결정을 가능하게 하지만, 모방 학습에 크게 의존함으로써 오차 누적과 분포 변화 상황에서의 낮은 견고성을 초래합니다. 강화 학습(RL)은 이러한 문제를 완화할 수 있지만, 일반적으로 비용이 많이 드는 실제 상호작용을 요구하거나 시뮬레이션-실제 간격(sim-to-real gap) 문제에 직면합니다. 우리는 데이터 기반 세계 모델을 제어 가능한 시뮬레이터로 활용하는 강화 미세 조정 프레임워크인 VLA-RFT를 소개합니다. 실제 상호작용 데이터로부터 학습된 이 시뮬레이터는 행동에 따라 미래의 시각적 관측을 예측하며, 목표 달성 참조에서 도출된 조밀한 궤적 수준의 보상을 통해 정책 롤아웃을 가능하게 합니다. 이 설계는 효율적이고 행동에 정렬된 학습 신호를 제공하여 샘플 요구량을 크게 줄입니다. 400회 미만의 미세 조정 단계로 VLA-RFT는 강력한 지도 학습 베이스라인을 능가하며 시뮬레이터 기반 RL보다 더 큰 효율성을 달성합니다. 또한, 이 모델은 교란된 조건에서도 강한 견고성을 보이며 안정적인 작업 실행을 유지합니다. 우리의 결과는 세계 모델 기반 RFT가 VLA 모델의 일반화와 견고성을 향상시키는 실용적인 사후 학습 패러다임으로 자리 잡았음을 입증합니다. 자세한 내용은 https://vla-rft.github.io/를 참조하십시오.

English

Vision-Language-Action (VLA) models enable embodied decision-making but rely heavily on imitation learning, leading to compounding errors and poor robustness under distribution shift. Reinforcement learning (RL) can mitigate these issues yet typically demands costly real-world interactions or suffers from sim-to-real gaps. We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory-level rewards derived from goal-achieving references. This design delivers an efficient and action-aligned learning signal, drastically lowering sample requirements. With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL. Moreover, it exhibits strong robustness under perturbed conditions, sustaining stable task execution. Our results establish world-model-based RFT as a practical post-training paradigm to enhance the generalization and robustness of VLA models. For more details, please refer to https://vla-rft.github.io/.

VLA-RFT: 세계 시뮬레이터에서 검증된 보상을 활용한 시각-언어-행동 강화 미세 조정

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

초록

Support