ChatPaper.aiChatPaper

X-VLA:軟提示變壓器作為可擴展的跨具身視覺-語言-行動模型

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

October 11, 2025
作者: Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, Xianyuan Zhan
cs.AI

摘要

成功的通用視覺-語言-動作(VLA)模型依賴於在多樣化機器人平台上進行有效訓練,並利用大規模、跨體現、異質數據集。為了促進並利用豐富多樣的機器人數據源中的異質性,我們提出了一種新穎的軟提示方法,僅需添加極少參數,通過將提示學習概念融入跨體現機器人學習,並為每個獨特數據源引入可學習的嵌入集。這些嵌入作為體現特定的提示,共同賦予VLA模型有效利用不同跨體現特徵的能力。我們的新X-VLA,一種基於流匹配的簡潔VLA架構,完全依賴於軟提示的標準Transformer編碼器,兼具可擴展性和簡潔性。在6個模擬環境及3個真實世界機器人上的評估中,我們的0.9B實例——X-VLA-0.9B,在一系列基準測試中同時達到了SOTA性能,展示了從靈活靈巧性到跨體現、環境和任務快速適應的廣泛能力軸上的優異成果。網站:https://thu-air-dream.github.io/X-VLA/
English
Successful generalist Vision-Language-Action (VLA) models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders, enjoying both scalability and simplicity. Evaluated across 6 simulations as well as 3 real-world robots, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves SOTA performance over a sweep of benchmarks, demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks. Website: https://thu-air-dream.github.io/X-VLA/
PDF132October 16, 2025