高效VLA：视觉-语言-动作模型的无训练加速与压缩

摘要

视觉-语言-动作（VLA）模型，尤其是基于扩散架构的模型，展现了在具身智能领域的变革潜力，但其广泛存在的固有冗余和推理时的高计算与内存需求严重制约了其发展。现有的加速工作往往针对孤立的低效环节，这类零散解决方案通常无法全面应对整个VLA流程中多样化的计算与内存瓶颈，从而限制了实际部署的可行性。我们提出了EfficientVLA，一个结构化且无需训练的推理加速框架，通过协同利用多方面的冗余，系统性地消除这些障碍。EfficientVLA整合了三种针对性策略：(1) 基于层间冗余分析，从语言模块中剪枝功能上无关紧要的层；(2) 通过任务感知策略优化视觉处理路径，选择一组紧凑且多样化的视觉标记，在任务关键性与信息覆盖之间取得平衡；(3) 在基于迭代扩散的动作头中，通过策略性地缓存和重用关键中间特征，缓解时间上的计算冗余。我们将该方法应用于标准VLA模型CogACT，在SIMPLER基准测试中实现了1.93倍的推理加速，并将浮点运算量降至28.9%，成功率仅下降0.6%。

English

Vision-Language-Action (VLA) models, particularly diffusion-based architectures, demonstrate transformative potential for embodied intelligence but are severely hampered by high computational and memory demands stemming from extensive inherent and inference-time redundancies. While existing acceleration efforts often target isolated inefficiencies, such piecemeal solutions typically fail to holistically address the varied computational and memory bottlenecks across the entire VLA pipeline, thereby limiting practical deployability. We introduce EfficientVLA, a structured and training-free inference acceleration framework that systematically eliminates these barriers by cohesively exploiting multifaceted redundancies. EfficientVLA synergistically integrates three targeted strategies: (1) pruning of functionally inconsequential layers from the language module, guided by an analysis of inter-layer redundancies; (2) optimizing the visual processing pathway through a task-aware strategy that selects a compact, diverse set of visual tokens, balancing task-criticality with informational coverage; and (3) alleviating temporal computational redundancy within the iterative diffusion-based action head by strategically caching and reusing key intermediate features. We apply our method to a standard VLA model CogACT, yielding a 1.93X inference speedup and reduces FLOPs to 28.9%, with only a 0.6% success rate drop in the SIMPLER benchmark.

高效VLA：视觉-语言-动作模型的无训练加速与压缩

EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models

摘要

Support