ChatPaper.aiChatPaper

十大挑战引领视觉-语言-动作模型未来发展方向

10 Open Challenges Steering the Future of Vision-Language-Action Models

November 8, 2025
作者: Soujanya Poria, Navonil Majumder, Chia-Yu Hung, Amir Ali Bagherzadeh, Chuan Li, Kenneth Kwok, Ziwei Wang, Cheston Tan, Jiajun Wu, David Hsu
cs.AI

摘要

得益于遵循自然语言指令的能力,视觉-语言-动作模型在具身人工智能领域日益普及,这延续了其前身——大语言模型和视觉语言模型取得的广泛成功。本文系统梳理了VLA模型发展进程中的十大核心里程碑:多模态融合、推理能力、数据构建、评估体系、跨机器人动作泛化、运行效率、全身协调、安全性保障、智能体架构以及人机协作。我们还深入探讨了空间理解、世界动态建模、后训练优化和数据合成等新兴技术趋势——这些方向共同推动着VLA模型实现上述里程碑目标。通过系统分析,我们希望引导学界关注那些能加速VLA模型获得更广泛适用性的研究路径。
English
Due to their ability of follow natural language instructions, vision-language-action (VLA) models are increasingly prevalent in the embodied AI arena, following the widespread success of their precursors -- LLMs and VLMs. In this paper, we discuss 10 principal milestones in the ongoing development of VLA models -- multimodality, reasoning, data, evaluation, cross-robot action generalization, efficiency, whole-body coordination, safety, agents, and coordination with humans. Furthermore, we discuss the emerging trends of using spatial understanding, modeling world dynamics, post training, and data synthesis -- all aiming to reach these milestones. Through these discussions, we hope to bring attention to the research avenues that may accelerate the development of VLA models into wider acceptability.
PDF52December 2, 2025