VP-VLA：视觉提示作为视觉-语言-动作模型的交互接口

摘要

視覺-語言-動作模型通常直接將視覺觀測與語言指令映射為機器人控制信號。這種「黑箱」式映射要求單次前向傳播同時處理指令解析、空間定位與低層級控制，常導致空間精度不足且在分布外場景中魯棒性有限。為解決這些局限，我們提出VP-VLA雙系統框架，通過結構化視覺提示接口實現高層推理與低層執行的解耦。具體而言，「系統2規劃器」將複雜指令分解為子任務，並識別相關目標物體與目的位置。這些空間錨點隨後以十字準星、邊界框等結構化視覺提示形式直接疊加於視覺觀測中。在訓練時通過新型輔助視覺定位目標增強後，「系統1控制器」能依託這些提示可靠生成精確的低層執行動作。在Robocasa-GR1-Tabletop基準測試與SimplerEnv仿真中的實驗表明，VP-VLA將成功率分別提升5%與8.3%，超越包括QwenOFT與GR00T-N1.6在內的競爭基線模型。

English

Vision-Language-Action (VLA) models typically map visual observations and linguistic instructions directly to robotic control signals. This "black-box" mapping forces a single forward pass to simultaneously handle instruction interpretation, spatial grounding, and low-level control, often leading to poor spatial precision and limited robustness in out-of-distribution scenarios. To address these limitations, we propose VP-VLA, a dual-system framework that decouples high-level reasoning and low-level execution via a structured visual prompting interface. Specifically, a "System 2 Planner" decomposes complex instructions into sub-tasks and identifies relevant target objects and goal locations. These spatial anchors are then overlaid directly onto visual observations as structured visual prompts, such as crosshairs and bounding boxes. Guided by these prompts and enhanced by a novel auxiliary visual grounding objective during training, a "System 1 Controller" reliably generates precise low-level execution motions. Experiments on the Robocasa-GR1-Tabletop benchmark and SimplerEnv simulation demonstrate that VP-VLA improves success rates by 5% and 8.3%, surpassing competitive baselines including QwenOFT and GR00T-N1.6.

VP-VLA：视觉提示作为视觉-语言-动作模型的交互接口

VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

摘要

Support