SwiftVLA：以最小开销解锁轻量级VLA模型的时空动态分析能力

摘要

基于预训练视觉语言模型构建的视觉-语言-动作模型虽展现出强大潜力，但因参数量庞大导致实用性受限。为缓解此问题，现有研究尝试采用轻量化视觉语言模型，但会牺牲时空推理能力。尽管部分方法指出引入额外3D输入可改善此问题，但这些方案通常依赖大型视觉语言模型来融合3D与2D输入，且仍缺乏时序理解能力。为此，我们提出SwiftVLA架构，在保持设计效率的同时为紧凑模型赋予四维空间理解能力。具体而言，我们的方法采用带有时序缓存的预训练四维视觉几何变换器，可从二维图像中提取四维特征。为增强视觉语言模型协同利用二维图像与四维特征的能力，我们引入融合标记——一组通过未来预测目标训练的可学习标记，用于生成动作规划的统一表征。最后，我们提出掩码重建策略：通过掩码输入视觉语言模型的四维数据并训练模型进行重建，使视觉语言模型能学习有效的四维表征，进而在推理阶段可移除四维分支且仅造成最小性能损失。真实环境与模拟实验表明，SwiftVLA不仅优于轻量级基线模型，更可媲美参数量达其7倍的大型视觉-语言-动作模型，在边缘设备上实现相当性能的同时，推理速度提升18倍且内存占用减少12倍。

English

Vision-Language-Action (VLA) models built on pretrained Vision-Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that extracts 4D features from 2D images. Then, to enhance the VLM's ability to exploit both 2D images and 4D features, we introduce Fusion Tokens, a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a mask-and-reconstruct strategy that masks 4D inputs to the VLM and trains the VLA to reconstruct them, enabling the VLM to learn effective 4D representations and allowing the 4D branch to be dropped at inference with minimal performance loss. Experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7 times larger, achieving comparable performance on edge devices while being 18 times faster and reducing memory footprint by 12 times.

SwiftVLA：以最小开销解锁轻量级VLA模型的时空动态分析能力

SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

摘要

Support