시각-언어-행동 모델의 미래를 이끌 10가지 공개 과제

초록

자연어 명령을 따를 수 있는 능력 덕분에 비전-언어-행동(VLA) 모델은 그 전신인 LLM과 VLM의 광범위한 성공에 이어 구현형 AI 영역에서 점점 더 보편화되고 있습니다. 본 논문에서는 VLA 모델의 지속적인 발전 과정에서 나타나는 10가지 주요 이정표—다중모달성, 추론, 데이터, 평가, 로봇 간 행동 일반화, 효율성, 전신 조율, 안전성, 에이전트, 인간과의 협력—에 대해 논의합니다. 나아가 이러한 이정표에 도달하기 위한 공간 이해 활용, 세계 역학 모델링, 사후 훈련, 데이터 합성 등 신흥 트렌드에 대해서도 살펴봅니다. 이러한 논의를 통해 VLA 모델의 개발이 보다 폭넓은 수용성을 얻는 방향으로 가속화될 수 있는 연구 경로에 주목을促하고자 합니다.

English

Due to their ability of follow natural language instructions, vision-language-action (VLA) models are increasingly prevalent in the embodied AI arena, following the widespread success of their precursors -- LLMs and VLMs. In this paper, we discuss 10 principal milestones in the ongoing development of VLA models -- multimodality, reasoning, data, evaluation, cross-robot action generalization, efficiency, whole-body coordination, safety, agents, and coordination with humans. Furthermore, we discuss the emerging trends of using spatial understanding, modeling world dynamics, post training, and data synthesis -- all aiming to reach these milestones. Through these discussions, we hope to bring attention to the research avenues that may accelerate the development of VLA models into wider acceptability.

시각-언어-행동 모델의 미래를 이끌 10가지 공개 과제

10 Open Challenges Steering the Future of Vision-Language-Action Models

초록

Support