高效视觉-语言-动作模型研究综述
A Survey on Efficient Vision-Language-Action Models
October 27, 2025
作者: Zhaoshu Yu, Bo Wang, Pengpeng Zeng, Haonan Zhang, Ji Zhang, Lianli Gao, Jingkuan Song, Nicu Sebe, Heng Tao Shen
cs.AI
摘要
视觉-语言-动作模型(VLAs)是具身智能领域的重要前沿,旨在将数字知识与物理世界交互相融合。尽管这类模型已展现出卓越的通用能力,但其底层大规模基础模型固有的巨大计算与数据需求严重制约了实际部署。为应对这些紧迫挑战,本文首次从数据-模型-训练全流程视角对高效视觉-语言-动作模型(Efficient VLAs)展开系统性综述。我们提出统一分类法,将现有技术归纳为三大核心支柱:(1)聚焦高效架构与模型压缩的高效模型设计;(2)降低模型学习过程计算负担的高效训练;(3)解决机器人数据获取与利用瓶颈的高效数据收集。通过对该框架下前沿方法的批判性分析,本综述不仅为学界建立基础参考系,还总结了代表性应用场景,厘清关键挑战,并绘制未来研究路线图。我们通过持续更新的项目页面追踪最新进展:https://evla-survey.github.io/
English
Vision-Language-Action models (VLAs) represent a significant frontier in
embodied intelligence, aiming to bridge digital knowledge with physical-world
interaction. While these models have demonstrated remarkable generalist
capabilities, their deployment is severely hampered by the substantial
computational and data requirements inherent to their underlying large-scale
foundation models. Motivated by the urgent need to address these challenges,
this survey presents the first comprehensive review of Efficient
Vision-Language-Action models (Efficient VLAs) across the entire
data-model-training process. Specifically, we introduce a unified taxonomy to
systematically organize the disparate efforts in this domain, categorizing
current techniques into three core pillars: (1) Efficient Model Design,
focusing on efficient architectures and model compression; (2) Efficient
Training, which reduces computational burdens during model learning; and (3)
Efficient Data Collection, which addresses the bottlenecks in acquiring and
utilizing robotic data. Through a critical review of state-of-the-art methods
within this framework, this survey not only establishes a foundational
reference for the community but also summarizes representative applications,
delineates key challenges, and charts a roadmap for future research. We
maintain a continuously updated project page to track our latest developments:
https://evla-survey.github.io/