ChatPaper.aiChatPaper

通过人类视频中的视觉-物理空间对齐实现空间感知VLA预训练

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

December 15, 2025
作者: Yicheng Feng, Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Sipeng Zheng, Zongqing Lu
cs.AI

摘要

视觉-语言-动作模型通过整合视觉感知与语言引导的策略学习,为机器人学习提供了前景广阔的范式。然而现有方法大多依赖二维视觉输入在三维物理环境中执行动作,导致感知与动作落地之间存在显著鸿沟。为弥合这一差距,我们提出空间感知型VLA预训练范式,在预训练阶段实现视觉空间与物理空间的显式对齐,使模型在机器人策略学习前即可获得三维空间理解能力。基于预训练的视觉语言模型,我们利用大规模人类演示视频提取三维视觉与三维动作标注,构建出能对齐二维视觉观测与三维空间推理的新监督信号。通过双编码器架构VIPA-VLA实例化该范式,该架构引入三维视觉编码器,将语义视觉表征增强为具有三维感知能力的特征。在下游机器人任务适配中,VIPA-VLA显著提升了二维视觉与三维动作的 grounding 效果,最终产生更鲁棒且泛化能力更强的机器人策略。
English
Vision-Language-Action (VLA) models provide a promising paradigm for robot learning by integrating visual perception with language-guided policy learning. However, most existing approaches rely on 2D visual inputs to perform actions in 3D physical environments, creating a significant gap between perception and action grounding. To bridge this gap, we propose a Spatial-Aware VLA Pretraining paradigm that performs explicit alignment between visual space and physical space during pretraining, enabling models to acquire 3D spatial understanding before robot policy learning. Starting from pretrained vision-language models, we leverage large-scale human demonstration videos to extract 3D visual and 3D action annotations, forming a new source of supervision that aligns 2D visual observations with 3D spatial reasoning. We instantiate this paradigm with VIPA-VLA, a dual-encoder architecture that incorporates a 3D visual encoder to augment semantic visual representations with 3D-aware features. When adapted to downstream robot tasks, VIPA-VLA achieves significantly improved grounding between 2D vision and 3D action, resulting in more robust and generalizable robotic policies.
PDF132December 17, 2025