世界行动模型是否比视觉语言模型泛化能力更强?一项鲁棒性研究
Do World Action Models Generalize Better than VLAs? A Robustness Study
April 1, 2026
作者: Zhanguang Zhang, Zhiyuan Li, Behnam Rahmati, Rui Heng Yang, Yintao Ma, Amir Rasouli, Sajjad Pakdamansavoji, Yangzheng Wu, Lingfeng Zhang, Tongtong Cao, Feng Wen, Xinyu Wang, Xingyue Quan, Yingxue Zhang
cs.AI
摘要
现实世界中的机器人动作规划具有挑战性,因其不仅需要理解环境的当前状态,还需预测环境如何响应动作而产生演变。基于视觉-语言-动作(VLA)的方法通过调用动作专家模块复用大规模视觉语言模型来生成机器人动作,已在多种机器人任务中取得显著成功。然而,其性能仍受限于训练数据的范围,对未见过场景的泛化能力较弱,且易受多样化语境干扰的影响。近期,世界模型被重新探索作为VLA的替代方案。这类被称为世界动作模型(WAM)的方法建立在世界模型基础上,通过海量视频数据训练以预测未来状态。经过微调适配,其潜在表征可被解码为机器人动作。研究表明,WAM凭借显式的动态预测能力,结合从网络规模视频预训练中获取的时空先验,能比VLA实现更有效的泛化。本文对主流VLA策略与新近发布的WAM进行了对比研究,在LIBERO-Plus和RoboTwin 2.0-Plus基准测试中评估了它们在不同视觉与语言干扰下的表现。实验结果表明WAM具有强鲁棒性,其中LingBot-VA在RoboTwin 2.0-Plus上达到74.2%的成功率,Cosmos-Policy在LIBERO-Plus上达到82.2%。虽然如π_{0.5}等VLA在特定任务中可实现相当鲁棒性,但通常需要依赖多样化机器人数据集和多目标进行大量训练。部分融合视频动态学习的混合方法表现出中等鲁棒性,这凸显了视频先验整合方式的重要性。
English
Robot action planning in the real world is challenging as it requires not only understanding the current state of the environment but also predicting how it will evolve in response to actions. Vision-language-action (VLA), which repurpose large-scale vision-language models for robot action generation using action experts, have achieved notable success across a variety of robotic tasks. Nevertheless, their performance remains constrained by the scope of their training data, exhibiting limited generalization to unseen scenarios and vulnerability to diverse contextual perturbations. More recently, world models have been revisited as an alternative to VLAs. These models, referred to as world action models (WAMs), are built upon world models that are trained on large corpora of video data to predict future states. With minor adaptations, their latent representation can be decoded into robot actions. It has been suggested that their explicit dynamic prediction capacity, combined with spatiotemporal priors acquired from web-scale video pretraining, enables WAMs to generalize more effectively than VLAs. In this paper, we conduct a comparative study of prominent state-of-the-art VLA policies and recently released WAMs. We evaluate their performance on the LIBERO-Plus and RoboTwin 2.0-Plus benchmarks under various visual and language perturbations. Our results show that WAMs achieve strong robustness, with LingBot-VA reaching 74.2% success rate on RoboTwin 2.0-Plus and Cosmos-Policy achieving 82.2% on LIBERO-Plus. While VLAs such as π_{0.5} can achieve comparable robustness on certain tasks, they typically require extensive training with diverse robotic datasets and varied learning objectives. Hybrid approaches that partially incorporate video-based dynamic learning exhibit intermediate robustness, highlighting the importance of how video priors are integrated.