EfficientVLA: 비전-언어-행동 모델을 위한 학습 없이 가능한 가속화 및 압축 기술

초록

비전-언어-행동(Vision-Language-Action, VLA) 모델, 특히 확산 기반 아키텍처는 구현된 지능(embodied intelligence)에 있어 혁신적인 잠재력을 보여주지만, 광범위한 내재적 및 추론 시 중복성으로 인해 높은 계산 및 메모리 요구량에 심각하게 제약을 받고 있습니다. 기존의 가속화 노력은 종종 고립된 비효율성을 대상으로 하지만, 이러한 부분적 해결책은 일반적으로 전체 VLA 파이프라인에 걸친 다양한 계산 및 메모리 병목 현상을 종합적으로 해결하지 못하여 실제 배포 가능성을 제한합니다. 우리는 EfficientVLA를 소개합니다. 이는 구조화되고 훈련이 필요 없는 추론 가속화 프레임워크로, 다각적인 중복성을 종합적으로 활용하여 이러한 장벽을 체계적으로 제거합니다. EfficientVLA는 세 가지 목표 전략을 시너지적으로 통합합니다: (1) 언어 모듈에서 기능적으로 중요하지 않은 레이어를 제거하며, 이는 레이어 간 중복성 분석을 통해 안내됩니다; (2) 시각 처리 경로를 최적화하기 위해 작업 인식 전략을 사용하여, 작업 중요성과 정보 커버리지를 균형 있게 고려한 간결하고 다양한 시각 토큰 집합을 선택합니다; (3) 반복적인 확산 기반 행동 헤드 내의 시간적 계산 중복성을 완화하기 위해 주요 중간 특징을 전략적으로 캐싱하고 재사용합니다. 우리는 이 방법을 표준 VLA 모델인 CogACT에 적용하여, SIMPLER 벤치마크에서 성공률이 단 0.6% 하락하는 대신 추론 속도를 1.93배 향상시키고 FLOPs를 28.9%로 줄였습니다.

English

Vision-Language-Action (VLA) models, particularly diffusion-based architectures, demonstrate transformative potential for embodied intelligence but are severely hampered by high computational and memory demands stemming from extensive inherent and inference-time redundancies. While existing acceleration efforts often target isolated inefficiencies, such piecemeal solutions typically fail to holistically address the varied computational and memory bottlenecks across the entire VLA pipeline, thereby limiting practical deployability. We introduce EfficientVLA, a structured and training-free inference acceleration framework that systematically eliminates these barriers by cohesively exploiting multifaceted redundancies. EfficientVLA synergistically integrates three targeted strategies: (1) pruning of functionally inconsequential layers from the language module, guided by an analysis of inter-layer redundancies; (2) optimizing the visual processing pathway through a task-aware strategy that selects a compact, diverse set of visual tokens, balancing task-criticality with informational coverage; and (3) alleviating temporal computational redundancy within the iterative diffusion-based action head by strategically caching and reusing key intermediate features. We apply our method to a standard VLA model CogACT, yielding a 1.93X inference speedup and reduces FLOPs to 28.9%, with only a 0.6% success rate drop in the SIMPLER benchmark.

EfficientVLA: 비전-언어-행동 모델을 위한 학습 없이 가능한 가속화 및 압축 기술

EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models

초록

Support