IntentVLA: 에일리어싱된 로봇 조작을 위한 단기 지평 의도 모델링

초록

로봇 모방 데이터는 종종 다중 모드(multimodal) 특성을 가진다. 유사한 시각-언어 관찰(visual-language observation)이 다른 행동 청크(action chunk)로 이어질 수 있는 이유는 인간 시연자가 서로 다른 단기 목표(short-horizon intent), 작업 단계(task phase), 또는 최근 맥락(recent context)에 따라 행동하기 때문이다. 기존의 프레임 조건부 VLA(Frame-conditioned VLA) 정책은 현재 관찰과 명령만을 기반으로 각 청크를 추론하므로, 부분 관측 가능성(partial observability) 하에서 인접한 재계획 단계(replanning step) 간에 서로 다른 의도를 재표본추출(resample)하여 청크 간 충돌(inter-chunk conflict)과 불안정한 실행을 초래할 수 있다. 본 논문에서는 역사 조건부 VLA(history-conditioned VLA) 프레임워크인 IntentVLA를 제안한다. 이 프레임워크는 최근 시각 관찰을 압축된 단기 목표 표현(compact short-horizon intent representation)으로 인코딩하고, 이를 사용하여 청크 생성을 조건화한다. 또한, RoboTwin2 상에서 단기 관찰 에일리어싱(short-horizon observation aliasing)을 분리하는 매칭된 훈련 데이터와 평가 환경을 갖춘 12개 작업 모호성 인식 벤치마크(ambiguity-aware benchmark)인 AliasBench를 소개한다. AliasBench, SimplerEnv, LIBERO, RoboCasa 전반에 걸쳐 IntentVLA는 롤아웃 안정성(rollout stability)을 향상시키고 강력한 VLA 기준선(baseline)을 능가한다.

English

Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines