villa-X:提升视觉-语言-动作模型中的潜在动作建模能力
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models
July 31, 2025
作者: Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, Jianyu Chen, Jiang Bian
cs.AI
摘要
视觉-语言-动作(VLA)模型已成为一种流行的范式,用于学习能够遵循语言指令并泛化到新场景的机器人操作策略。近期研究开始探索将潜在动作——即两帧之间视觉变化的抽象表征——融入VLA预训练中。本文提出villa-X,一种新颖的视觉-语言-潜在动作(ViLLA)框架,该框架推进了潜在动作建模,以学习可泛化的机器人操作策略。我们的方法改进了潜在动作的学习方式及其融入VLA预训练的过程。这些创新共同使villa-X在包括SIMPLER和LIBERO在内的模拟环境中,以及在夹爪和灵巧手操作的两个真实机器人设置上,均取得了卓越性能。我们相信ViLLA范式具有重大潜力,且villa-X为未来研究奠定了坚实基础。
English
Visual-Language-Action (VLA) models have emerged as a popular paradigm for
learning robot manipulation policies that can follow language instructions and
generalize to novel scenarios. Recent work has begun to explore the
incorporation of latent actions, an abstract representation of visual change
between two frames, into VLA pre-training. In this paper, we introduce villa-X,
a novel Visual-Language-Latent-Action (ViLLA) framework that advances latent
action modeling for learning generalizable robot manipulation policies. Our
approach improves both how latent actions are learned and how they are
incorporated into VLA pre-training. Together, these contributions enable
villa-X to achieve superior performance across simulated environments including
SIMPLER and LIBERO, as well as on two real-world robot setups including gripper
and dexterous hand manipulation. We believe the ViLLA paradigm holds
significant promise, and that our villa-X provides a strong foundation for
future research.