SwiftVLA: Sbloccare le Dinamiche Spazio-Temporali per Modelli VLA Leggeri con Sovraccarico Minimo

Abstract

I modelli Vision-Language-Action (VLA) basati su modelli linguistici visivi (VLM) pre-addestrati mostrano un forte potenziale, ma sono limitati nella praticità a causa del loro elevato numero di parametri. Per mitigare questo problema, l'uso di un VLM leggero è stato esplorato, ma ciò compromette il ragionamento spaziotemporale. Sebbene alcuni metodi suggeriscano che l'incorporamento di input 3D aggiuntivi possa aiutare, questi solitamente si affidano a VLM di grandi dimensioni per fondere input 3D e 2D e mancano ancora di comprensione temporale. Pertanto, proponiamo SwiftVLA, un'architettura che potenzia un modello compatto con una comprensione 4D preservando l'efficienza progettuale. Nello specifico, il nostro approccio include un trasformatore di geometria visiva 4D pre-addestrato con una cache temporale che estrae caratteristiche 4D da immagini 2D. Quindi, per potenziare la capacità del VLM di sfruttare sia le immagini 2D che le caratteristiche 4D, introduciamo i Fusion Token, un insieme di token apprendibili addestrati con un obiettivo di predizione futura per generare rappresentazioni unificate per la generazione di azioni. Infine, introduciamo una strategia di mascheramento e ricostruzione che maschera gli input 4D al VLM e addestra il VLA a ricostruirli, consentendo al VLM di apprendere rappresentazioni 4D efficaci e permettendo di eliminare il ramo 4D durante l'inferenza con una perdita di prestazioni minima. Esperimenti in ambienti reali e simulati mostrano che SwiftVLA supera i baseline leggeri e rivaleggia con VLA fino a 7 volte più grandi, raggiungendo prestazioni comparabili su dispositivi edge mentre è 18 volte più veloce e riduce l'ingombro di memoria di 12 volte.

English

Vision-Language-Action (VLA) models built on pretrained Vision-Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that extracts 4D features from 2D images. Then, to enhance the VLM's ability to exploit both 2D images and 4D features, we introduce Fusion Tokens, a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a mask-and-reconstruct strategy that masks 4D inputs to the VLM and trains the VLA to reconstruct them, enabling the VLM to learn effective 4D representations and allowing the 4D branch to be dropped at inference with minimal performance loss. Experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7 times larger, achieving comparable performance on edge devices while being 18 times faster and reducing memory footprint by 12 times.

SwiftVLA: Sbloccare le Dinamiche Spazio-Temporali per Modelli VLA Leggeri con Sovraccarico Minimo

SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

Abstract

Support