VP-VLA: 視覚言語行動モデルにおけるインターフェースとしてのビジュアルプロンプティング

要旨

Vision-Language-Action（VLA）モデルは通常、視覚観測と言語指示を直接ロボット制御信号に写像する。この「ブラックボックス」的な写像は、単一のフォワードパスで指示解釈・空間的接地・低レベル制御を同時に処理することを強いるため、空間的精度の低さや分布外シナリオでの頑健性の限界が生じやすい。これらの課題を解決するため、我々は構造化視覚プロンプトインターフェースを介して高次推論と低次実行を分離する二重システムフレームワーク「VP-VLA」を提案する。具体的には、「System 2プランナー」が複雑な指示をサブタスクに分解し、関連する対象物体と目標位置を特定する。これらの空間的アンカーは、十字マークやバウンディングボックスなどの構造化視覚プロンプトとして視覚観測に直接重ねられる。訓練時には新規の補助視覚接地目的関数により強化されつつ、これらのプロンプトに導かれた「System 1コントローラー」が、精密な低次実行動作を確実に生成する。Robocasa-GR1-TabletopベンチマークとSimplerEnvシミュレーションによる実験では、VP-VLAが成功率をそれぞれ5％、8.3％向上させ、QwenOFTやGR00T-N1.6を含む競合ベースラインを凌駕することを示した。

English

Vision-Language-Action (VLA) models typically map visual observations and linguistic instructions directly to robotic control signals. This "black-box" mapping forces a single forward pass to simultaneously handle instruction interpretation, spatial grounding, and low-level control, often leading to poor spatial precision and limited robustness in out-of-distribution scenarios. To address these limitations, we propose VP-VLA, a dual-system framework that decouples high-level reasoning and low-level execution via a structured visual prompting interface. Specifically, a "System 2 Planner" decomposes complex instructions into sub-tasks and identifies relevant target objects and goal locations. These spatial anchors are then overlaid directly onto visual observations as structured visual prompts, such as crosshairs and bounding boxes. Guided by these prompts and enhanced by a novel auxiliary visual grounding objective during training, a "System 1 Controller" reliably generates precise low-level execution motions. Experiments on the Robocasa-GR1-Tabletop benchmark and SimplerEnv simulation demonstrate that VP-VLA improves success rates by 5% and 8.3%, surpassing competitive baselines including QwenOFT and GR00T-N1.6.

VP-VLA: 視覚言語行動モデルにおけるインターフェースとしてのビジュアルプロンプティング

VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

要旨

Support