villa-X：提升视觉-语言-动作模型中的潜在动作建模能力

摘要

视觉-语言-动作（VLA）模型已成为一种流行的范式，用于学习能够遵循语言指令并泛化到新场景的机器人操作策略。近期研究开始探索将潜在动作——即两帧之间视觉变化的抽象表征——融入VLA预训练中。本文提出villa-X，一种新颖的视觉-语言-潜在动作（ViLLA）框架，该框架推进了潜在动作建模，以学习可泛化的机器人操作策略。我们的方法改进了潜在动作的学习方式及其融入VLA预训练的过程。这些创新共同使villa-X在包括SIMPLER和LIBERO在内的模拟环境中，以及在夹爪和灵巧手操作的两个真实机器人设置上，均取得了卓越性能。我们相信ViLLA范式具有重大潜力，且villa-X为未来研究奠定了坚实基础。

English

Visual-Language-Action (VLA) models have emerged as a popular paradigm for learning robot manipulation policies that can follow language instructions and generalize to novel scenarios. Recent work has begun to explore the incorporation of latent actions, an abstract representation of visual change between two frames, into VLA pre-training. In this paper, we introduce villa-X, a novel Visual-Language-Latent-Action (ViLLA) framework that advances latent action modeling for learning generalizable robot manipulation policies. Our approach improves both how latent actions are learned and how they are incorporated into VLA pre-training. Together, these contributions enable villa-X to achieve superior performance across simulated environments including SIMPLER and LIBERO, as well as on two real-world robot setups including gripper and dexterous hand manipulation. We believe the ViLLA paradigm holds significant promise, and that our villa-X provides a strong foundation for future research.

villa-X：提升视觉-语言-动作模型中的潜在动作建模能力

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

摘要

Support