十大挑战引领视觉-语言-动作模型未来发展方向
10 Open Challenges Steering the Future of Vision-Language-Action Models
November 8, 2025
作者: Soujanya Poria, Navonil Majumder, Chia-Yu Hung, Amir Ali Bagherzadeh, Chuan Li, Kenneth Kwok, Ziwei Wang, Cheston Tan, Jiajun Wu, David Hsu
cs.AI
摘要
得益于遵循自然语言指令的能力,视觉-语言-动作模型在具身人工智能领域日益普及,这延续了其前身——大语言模型和视觉语言模型取得的广泛成功。本文系统梳理了VLA模型发展进程中的十大核心里程碑:多模态融合、推理能力、数据构建、评估体系、跨机器人动作泛化、运行效率、全身协调、安全性保障、智能体架构以及人机协作。我们还深入探讨了空间理解、世界动态建模、后训练优化和数据合成等新兴技术趋势——这些方向共同推动着VLA模型实现上述里程碑目标。通过系统分析,我们希望引导学界关注那些能加速VLA模型获得更广泛适用性的研究路径。
English
Due to their ability of follow natural language instructions,
vision-language-action (VLA) models are increasingly prevalent in the embodied
AI arena, following the widespread success of their precursors -- LLMs and
VLMs. In this paper, we discuss 10 principal milestones in the ongoing
development of VLA models -- multimodality, reasoning, data, evaluation,
cross-robot action generalization, efficiency, whole-body coordination, safety,
agents, and coordination with humans. Furthermore, we discuss the emerging
trends of using spatial understanding, modeling world dynamics, post training,
and data synthesis -- all aiming to reach these milestones. Through these
discussions, we hope to bring attention to the research avenues that may
accelerate the development of VLA models into wider acceptability.