高效VLA:面向视觉-语言-动作模型的无训练加速与压缩技术
EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models
June 11, 2025
作者: Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou, Zhipeng Zhang, Chuan Wen, Linfeng Zhang
cs.AI
摘要
視覺-語言-動作(VLA)模型,尤其是基於擴散架構的模型,展現了對具身智能的變革性潛力,但由於其固有的廣泛冗餘及推理時的高計算與記憶體需求,其應用受到嚴重限制。現有的加速努力往往針對孤立的低效問題,此類零散解決方案通常無法全面應對整個VLA流程中的多樣化計算與記憶體瓶頸,從而限制了實際部署的可行性。我們提出了EfficientVLA,這是一個結構化且無需訓練的推理加速框架,通過協同利用多方面的冗餘,系統性地消除這些障礙。EfficientVLA協同整合了三種針對性策略:(1) 基於層間冗餘分析,從語言模塊中剪枝功能上無關緊要的層;(2) 通過任務感知策略優化視覺處理路徑,選擇一組緊湊且多樣化的視覺標記,平衡任務關鍵性與信息覆蓋範圍;(3) 在基於迭代擴散的動作頭中,通過策略性地緩存和重用關鍵中間特徵,減輕時間上的計算冗餘。我們將此方法應用於標準VLA模型CogACT,在SIMPLER基準測試中實現了1.93倍的推理速度提升,並將浮點運算次數(FLOPs)降低至28.9%,成功率僅下降0.6%。
English
Vision-Language-Action (VLA) models, particularly diffusion-based
architectures, demonstrate transformative potential for embodied intelligence
but are severely hampered by high computational and memory demands stemming
from extensive inherent and inference-time redundancies. While existing
acceleration efforts often target isolated inefficiencies, such piecemeal
solutions typically fail to holistically address the varied computational and
memory bottlenecks across the entire VLA pipeline, thereby limiting practical
deployability. We introduce EfficientVLA, a structured and training-free
inference acceleration framework that systematically eliminates these barriers
by cohesively exploiting multifaceted redundancies. EfficientVLA
synergistically integrates three targeted strategies: (1) pruning of
functionally inconsequential layers from the language module, guided by an
analysis of inter-layer redundancies; (2) optimizing the visual processing
pathway through a task-aware strategy that selects a compact, diverse set of
visual tokens, balancing task-criticality with informational coverage; and (3)
alleviating temporal computational redundancy within the iterative
diffusion-based action head by strategically caching and reusing key
intermediate features. We apply our method to a standard VLA model CogACT,
yielding a 1.93X inference speedup and reduces FLOPs to 28.9%, with only a 0.6%
success rate drop in the SIMPLER benchmark.