從空間到行動:將視覺-語言-行動模型錨定於空間基礎先驗
From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors
October 20, 2025
作者: Zhengshen Zhang, Hao Li, Yalun Dai, Zhengbang Zhu, Lei Zhou, Chenchen Liu, Dong Wang, Francis E. H. Tay, Sijin Chen, Ziwei Liu, Yuxiao Liu, Xinghang Li, Pan Zhou
cs.AI
摘要
現有的視覺-語言-動作模型雖能在三維現實世界中行動,但通常建基於二維編碼器,存在空間推理差距,限制了泛化能力與適應性。近期針對VLA的三維整合技術要么需要特殊傳感器且跨模態遷移效果不佳,要么注入缺乏幾何信息的弱線索並損害視覺-語言對齊。本研究提出FALCON(從空間到動作)新範式,通過向動作頭注入豐富的三維空間標記,僅利用RGB圖像即可通過空間基礎模型提供強幾何先驗。該框架包含具身空間模型,可選擇性融合深度信息或位姿數據以提升保真度,且無需重新訓練或改變架構。為保持語言推理能力,空間標記由空間增強型動作頭處理而非直接拼接至視覺-語言主幹網。這些設計使FALCON能突破空間表徵、模態遷移性和對齊性的局限。在三個仿真基準與十一項現實任務的綜合評估中,FALCON實現了最先進的性能,持續超越競爭基線模型,並在雜亂環境、空間提示條件化、物體尺度與高度變化等場景下保持強健性。
English
Existing vision-language-action (VLA) models act in 3D real-world but are
typically built on 2D encoders, leaving a spatial reasoning gap that limits
generalization and adaptability. Recent 3D integration techniques for VLAs
either require specialized sensors and transfer poorly across modalities, or
inject weak cues that lack geometry and degrade vision-language alignment. In
this work, we introduce FALCON (From Spatial to Action), a novel paradigm that
injects rich 3D spatial tokens into the action head. FALCON leverages spatial
foundation models to deliver strong geometric priors from RGB alone, and
includes an Embodied Spatial Model that can optionally fuse depth, or pose for
higher fidelity when available, without retraining or architectural changes. To
preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced
Action Head rather than being concatenated into the vision-language backbone.
These designs enable FALCON to address limitations in spatial representation,
modality transferability, and alignment. In comprehensive evaluations across
three simulation benchmarks and eleven real-world tasks, our proposed FALCON
achieves state-of-the-art performance, consistently surpasses competitive
baselines, and remains robust under clutter, spatial-prompt conditioning, and
variations in object scale and height.