視覚-言語-行動モデルの未来を方向付ける10の未解決課題

要旨

自然言語による指示に従う能力により、ビジョン・ランゲージ・アクション（VLA）モデルは、その前身であるLLMやVLMの広範な成功を受け、具体化AIの分野でますます普及している。本論文では、VLAモデルの継続的な発展における10の主要なマイルストーン——マルチモダリティ、推論、データ、評価、ロボット横断的行動一般化、効率性、全身協調、安全性、エージェント、人間との協調——について論じる。さらに、これらのマイルストーン達成を目指す、空間理解の活用、世界のダイナミクスのモデル化、事後学習、データ合成といった新たな潮流についても考察する。これらの議論を通じて、VLAモデルの開発がより広範な受容性を得るまでの道筋を加速させる可能性のある研究分野に注目が集まることを期待する。

English

Due to their ability of follow natural language instructions, vision-language-action (VLA) models are increasingly prevalent in the embodied AI arena, following the widespread success of their precursors -- LLMs and VLMs. In this paper, we discuss 10 principal milestones in the ongoing development of VLA models -- multimodality, reasoning, data, evaluation, cross-robot action generalization, efficiency, whole-body coordination, safety, agents, and coordination with humans. Furthermore, we discuss the emerging trends of using spatial understanding, modeling world dynamics, post training, and data synthesis -- all aiming to reach these milestones. Through these discussions, we hope to bring attention to the research avenues that may accelerate the development of VLA models into wider acceptability.

視覚-言語-行動モデルの未来を方向付ける10の未解決課題

10 Open Challenges Steering the Future of Vision-Language-Action Models

要旨

Support