SwiftVLA：以最小开销解锁轻量化视觉语言动作模型的时空动态建模能力

摘要

基于预训练视觉语言模型构建的视觉-语言-动作模型展现出强大潜力，但由于参数量庞大，实际应用受限。为缓解该问题，已有研究尝试采用轻量化视觉语言模型，但这会削弱时空推理能力。虽然部分方法指出引入额外三维输入可改善性能，但这些方案通常依赖大型视觉语言模型来融合三维与二维输入，且仍缺乏时序理解能力。为此，我们提出SwiftVLA架构，在保持设计效率的同时，为紧凑模型赋予四维理解能力。具体而言，我们采用带时序缓存的预训练四维视觉几何变换器，从二维图像中提取四维特征。随后引入融合令牌——一组通过未来预测目标训练的可学习令牌，用于生成统一表征以增强视觉语言模型协同利用二维图像与四维特征的能力。最后，我们设计掩码重构策略：对输入视觉语言模型的四维数据进行掩码处理，并训练模型进行重构，使视觉语言模型能学习有效的四维表征，且推理时可移除四维分支而仅产生微小性能损失。真实与模拟环境中的实验表明，SwiftVLA性能优于轻量级基线模型，并与参数量达其7倍的视觉-语言-动作模型相当，在边缘设备上实现相近性能的同时，速度提升18倍且内存占用减少12倍。

English

Vision-Language-Action (VLA) models built on pretrained Vision-Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that extracts 4D features from 2D images. Then, to enhance the VLM's ability to exploit both 2D images and 4D features, we introduce Fusion Tokens, a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a mask-and-reconstruct strategy that masks 4D inputs to the VLM and trains the VLA to reconstruct them, enabling the VLM to learn effective 4D representations and allowing the 4D branch to be dropped at inference with minimal performance loss. Experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7 times larger, achieving comparable performance on edge devices while being 18 times faster and reducing memory footprint by 12 times.

SwiftVLA：以最小开销解锁轻量化视觉语言动作模型的时空动态建模能力

SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

摘要

Support