IntentVLA: エイリアス化されたロボット操作のための短期意図モデリング

要旨

ロボットの模倣データはしばしばマルチモーダルである。すなわち、類似した視覚言語観測に対して、人間の実演者が異なる短期意図やタスクフェーズ、最近の文脈に基づいて行動するため、異なる行動チャンクが続く可能性がある。既存のフレーム条件付きVLAポリシーは、各チャンクを現在の観測と指示のみから推論するため、部分観測性の下では隣接する再計画ステップ間で異なる意図を再サンプリングし、チャンク間の競合や不安定な実行を引き起こす。本稿では、IntentVLAを提案する。これは、最近の視覚観測をコンパクトな短期意図表現に符号化し、それを使ってチャンク生成を条件付ける履歴条件付きVLAフレームワークである。さらに、短期観測のエイリアシングを分離した、整合された訓練データと評価環境を備えたRoboTwin2上の12タスクの曖昧性認識ベンチマークであるAliasBenchを導入する。AliasBench、SimplerEnv、LIBERO、RoboCasaにおいて、IntentVLAはロールアウトの安定性を向上させ、強力なVLAベースラインを凌駕する。

English

Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines