VP-VLA:视觉提示作为视觉-语言-动作模型的交互接口
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
March 23, 2026
作者: Zixuan Wang, Yuxin Chen, Yuqi Liu, Jinhui Ye, Pengguang Chen, Changsheng Lu, Shu Liu, Jiaya Jia
cs.AI
摘要
视觉-语言-动作(VLA)模型通常直接将视觉观察与语言指令映射为机器人控制信号。这种"黑箱"式映射要求单次前向传播同时处理指令解析、空间定位和底层控制,往往导致空间精度不足且在分布外场景中鲁棒性有限。为突破这些局限,我们提出VP-VLA——一种通过结构化视觉提示接口解耦高层推理与底层执行的双系统框架。具体而言,"系统2规划器"将复杂指令分解为子任务并识别相关目标物体及目标位置,这些空间锚点随后以十字准星、边界框等形式作为结构化视觉提示直接叠加在视觉观察上。在训练过程中,通过新型辅助视觉定位目标函数的增强,"系统1控制器"能依托这些视觉提示可靠地生成精确的底层执行动作。在Robocasa-GR1-Tabletop基准测试与SimplerEnv仿真环境中的实验表明,VP-VLA将任务成功率分别提升5%和8.3%,超越了包括QwenOFT与GR00T-N1.6在内的竞争基线模型。
English
Vision-Language-Action (VLA) models typically map visual observations and linguistic instructions directly to robotic control signals. This "black-box" mapping forces a single forward pass to simultaneously handle instruction interpretation, spatial grounding, and low-level control, often leading to poor spatial precision and limited robustness in out-of-distribution scenarios. To address these limitations, we propose VP-VLA, a dual-system framework that decouples high-level reasoning and low-level execution via a structured visual prompting interface. Specifically, a "System 2 Planner" decomposes complex instructions into sub-tasks and identifies relevant target objects and goal locations. These spatial anchors are then overlaid directly onto visual observations as structured visual prompts, such as crosshairs and bounding boxes. Guided by these prompts and enhanced by a novel auxiliary visual grounding objective during training, a "System 1 Controller" reliably generates precise low-level execution motions. Experiments on the Robocasa-GR1-Tabletop benchmark and SimplerEnv simulation demonstrate that VP-VLA improves success rates by 5% and 8.3%, surpassing competitive baselines including QwenOFT and GR00T-N1.6.