IntentVLA：面向混疊機器人操作的短時域意圖建模

摘要

機器人模仿數據常呈多模態：相似視覺語言觀測可能對應不同動作片段，原因在於人類示範者會因短期視野意圖、任務階段或近期情境而採取不同行為。現有的基於幀條件的VLA策略僅從當前觀測與指令推斷各片段，因此在部分可觀測情況下，跨相鄰重規劃步驟時可能重新採樣不同意圖，導致跨片段衝突與執行不穩定。我們提出IntentVLA——一種歷史條件驅動的VLA框架，將近期視覺觀測編碼為緊湊的短期視野意圖表示，並以此條件化片段生成。我們進一步引入AliasBench——基於RoboTwin2的12任務模糊感知基準測試，配備匹配的訓練數據與評估環境，以隔離短期視野觀測混淆。在AliasBench、SimplerEnv、LIBERO及RoboCasa上，IntentVLA提升了滾動執行穩定性，並優於強基線VLA方法。

English

Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines