ChatPaper.aiChatPaper

IntentVLA:面向混疊機器人操作的短時域意圖建模

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

May 14, 2026
作者: Shijie Lian, Bin Yu, Xiaopeng Lin, Zhaolong Shen, Laurence Tianruo Yang, Yurun Jin, Haishan Liu, Changti Wu, Hang Yuan, Cong Huang, Kai Chen
cs.AI

摘要

機器人模仿數據常呈多模態:相似視覺語言觀測可能對應不同動作片段,原因在於人類示範者會因短期視野意圖、任務階段或近期情境而採取不同行為。現有的基於幀條件的VLA策略僅從當前觀測與指令推斷各片段,因此在部分可觀測情況下,跨相鄰重規劃步驟時可能重新採樣不同意圖,導致跨片段衝突與執行不穩定。我們提出IntentVLA——一種歷史條件驅動的VLA框架,將近期視覺觀測編碼為緊湊的短期視野意圖表示,並以此條件化片段生成。我們進一步引入AliasBench——基於RoboTwin2的12任務模糊感知基準測試,配備匹配的訓練數據與評估環境,以隔離短期視野觀測混淆。在AliasBench、SimplerEnv、LIBERO及RoboCasa上,IntentVLA提升了滾動執行穩定性,並優於強基線VLA方法。
English
Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines