高效VLA:视觉-语言-动作模型的无训练加速与压缩
EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models
June 11, 2025
作者: Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou, Zhipeng Zhang, Chuan Wen, Linfeng Zhang
cs.AI
摘要
视觉-语言-动作(VLA)模型,尤其是基于扩散架构的模型,展现了在具身智能领域的变革潜力,但其广泛存在的固有冗余和推理时的高计算与内存需求严重制约了其发展。现有的加速工作往往针对孤立的低效环节,这类零散解决方案通常无法全面应对整个VLA流程中多样化的计算与内存瓶颈,从而限制了实际部署的可行性。我们提出了EfficientVLA,一个结构化且无需训练的推理加速框架,通过协同利用多方面的冗余,系统性地消除这些障碍。EfficientVLA整合了三种针对性策略:(1) 基于层间冗余分析,从语言模块中剪枝功能上无关紧要的层;(2) 通过任务感知策略优化视觉处理路径,选择一组紧凑且多样化的视觉标记,在任务关键性与信息覆盖之间取得平衡;(3) 在基于迭代扩散的动作头中,通过策略性地缓存和重用关键中间特征,缓解时间上的计算冗余。我们将该方法应用于标准VLA模型CogACT,在SIMPLER基准测试中实现了1.93倍的推理加速,并将浮点运算量降至28.9%,成功率仅下降0.6%。
English
Vision-Language-Action (VLA) models, particularly diffusion-based
architectures, demonstrate transformative potential for embodied intelligence
but are severely hampered by high computational and memory demands stemming
from extensive inherent and inference-time redundancies. While existing
acceleration efforts often target isolated inefficiencies, such piecemeal
solutions typically fail to holistically address the varied computational and
memory bottlenecks across the entire VLA pipeline, thereby limiting practical
deployability. We introduce EfficientVLA, a structured and training-free
inference acceleration framework that systematically eliminates these barriers
by cohesively exploiting multifaceted redundancies. EfficientVLA
synergistically integrates three targeted strategies: (1) pruning of
functionally inconsequential layers from the language module, guided by an
analysis of inter-layer redundancies; (2) optimizing the visual processing
pathway through a task-aware strategy that selects a compact, diverse set of
visual tokens, balancing task-criticality with informational coverage; and (3)
alleviating temporal computational redundancy within the iterative
diffusion-based action head by strategically caching and reusing key
intermediate features. We apply our method to a standard VLA model CogACT,
yielding a 1.93X inference speedup and reduces FLOPs to 28.9%, with only a 0.6%
success rate drop in the SIMPLER benchmark.