EfficientVLA: ビジョン・ランゲージ・アクションモデルのためのトレーニング不要な高速化と圧縮

要旨

Vision-Language-Action（VLA）モデル、特に拡散ベースのアーキテクチャは、具現化された知能において変革的な可能性を示すが、広範な内在的および推論時の冗長性に起因する高い計算量とメモリ要求によって深刻な制約を受けている。既存の高速化手法はしばしば個別の非効率性を対象とするが、そのような断片的な解決策は通常、VLAパイプライン全体にわたる多様な計算およびメモリのボトルネックを包括的に解決するには至らず、実用的な展開可能性を制限している。本論文では、EfficientVLAを提案する。これは、多面的な冗長性を統合的に活用することで、これらの障壁を体系的に排除する構造化されたトレーニング不要の推論高速化フレームワークである。EfficientVLAは、以下の3つの戦略を相乗的に統合する：(1) 言語モジュールの機能的に重要でない層を、層間の冗長性分析に基づいて剪定する。(2) タスクを意識した戦略により、視覚処理経路を最適化し、タスクの重要性と情報のカバレッジをバランスさせたコンパクトで多様な視覚トークンを選択する。(3) 反復的な拡散ベースのアクションヘッド内の時間的な計算冗長性を、戦略的に中間特徴をキャッシュし再利用することで軽減する。本手法を標準的なVLAモデルであるCogACTに適用した結果、推論速度が1.93倍向上し、FLOPsが28.9%に削減され、SIMPLERベンチマークでの成功率の低下はわずか0.6%であった。

English

Vision-Language-Action (VLA) models, particularly diffusion-based architectures, demonstrate transformative potential for embodied intelligence but are severely hampered by high computational and memory demands stemming from extensive inherent and inference-time redundancies. While existing acceleration efforts often target isolated inefficiencies, such piecemeal solutions typically fail to holistically address the varied computational and memory bottlenecks across the entire VLA pipeline, thereby limiting practical deployability. We introduce EfficientVLA, a structured and training-free inference acceleration framework that systematically eliminates these barriers by cohesively exploiting multifaceted redundancies. EfficientVLA synergistically integrates three targeted strategies: (1) pruning of functionally inconsequential layers from the language module, guided by an analysis of inter-layer redundancies; (2) optimizing the visual processing pathway through a task-aware strategy that selects a compact, diverse set of visual tokens, balancing task-criticality with informational coverage; and (3) alleviating temporal computational redundancy within the iterative diffusion-based action head by strategically caching and reusing key intermediate features. We apply our method to a standard VLA model CogACT, yielding a 1.93X inference speedup and reduces FLOPs to 28.9%, with only a 0.6% success rate drop in the SIMPLER benchmark.

EfficientVLA: ビジョン・ランゲージ・アクションモデルのためのトレーニング不要な高速化と圧縮

EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models

要旨

Support