X-VLA:作为可扩展跨具身智能体的软提示Transformer视觉-语言-动作模型
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
October 11, 2025
作者: Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, Xianyuan Zhan
cs.AI
摘要
成功的通用视觉-语言-动作(VLA)模型依赖于在多样化机器人平台上进行有效训练,并利用大规模、跨具身、异构的数据集。为了促进并充分利用丰富多样的机器人数据源中的异质性,我们提出了一种新颖的软提示方法,该方法仅需添加少量参数,通过将提示学习概念融入跨具身机器人学习,并为每个独特数据源引入独立可学习的嵌入向量。这些嵌入向量作为具身特定的提示,共同赋予VLA模型有效利用不同跨具身特征的能力。我们新提出的X-VLA,一种基于流匹配的简洁VLA架构,完全依赖于软提示的标准Transformer编码器,兼具可扩展性与简洁性。在6个仿真环境和3个真实世界机器人上的评估中,我们的0.9B实例——X-VLA-0.9B,在一系列基准测试中同时达到了当前最优(SOTA)性能,展示了从灵活操作到跨具身、环境及任务快速适应等多维度能力的卓越表现。网站:https://thu-air-dream.github.io/X-VLA/
English
Successful generalist Vision-Language-Action (VLA) models rely on effective
training across diverse robotic platforms with large-scale, cross-embodiment,
heterogeneous datasets. To facilitate and leverage the heterogeneity in rich,
diverse robotic data sources, we propose a novel Soft Prompt approach with
minimally added parameters, by infusing prompt learning concepts into
cross-embodiment robot learning and introducing separate sets of learnable
embeddings for each distinct data source. These embeddings serve as
embodiment-specific prompts, which in unity empower VLA models with effective
exploitation of varying cross-embodiment features. Our new X-VLA, a neat
flow-matching-based VLA architecture, relies exclusively on soft-prompted
standard Transformer encoders, enjoying both scalability and simplicity.
Evaluated across 6 simulations as well as 3 real-world robots, our 0.9B
instantiation-X-VLA-0.9B simultaneously achieves SOTA performance over a sweep
of benchmarks, demonstrating superior results on a wide axes of capabilities,
from flexible dexterity to quick adaptation across embodiments, environments,
and tasks. Website: https://thu-air-dream.github.io/X-VLA/