무엇을 할까? 비전-언어-행동 모델이 불가능한 것을 거부하도록 가르치기

초록

최근, 비전-언어-행동(Vision-Language-Action, VLA) 모델들은 다양한 로봇 작업에서 강력한 성능을 보여주고 있습니다. 이러한 모델들은 다중 모달 입력에 의존하며, 언어 명령어는 행동 예측뿐만 아니라 사용자 의도를 견고하게 해석하는 데 중요한 역할을 합니다. 특히, 요청이 실행 불가능한 경우에도 이를 해석할 수 있습니다. 본 연구에서는 VLA 모델이 환경에 존재하지 않는 객체나 조건을 참조하는 자연어 명령어인 거짓 전제(false-premise) 명령어를 어떻게 인식, 해석, 그리고 응답할 수 있는지 조사합니다. 우리는 Instruct-Verify-and-Act(IVA)라는 통합 프레임워크를 제안합니다. 이 프레임워크는 (i) 거짓 전제로 인해 명령어가 실행될 수 없음을 감지하고, (ii) 언어 기반의 명확화 또는 수정을 수행하며, (iii) 가능한 대안을 지각과 행동에 기반하여 구체화합니다. 이를 위해, 구조화된 언어 프롬프트를 포함한 대규모 명령어 튜닝 설정을 구성하고, 정확한 요청과 오류가 있는 요청을 모두 처리할 수 있는 VLA 모델을 학습시킵니다. 우리의 접근 방식은 긍정적 명령어와 거짓 전제 명령어가 쌍을 이루는 반합성 데이터셋을 활용하여, 견고한 감지와 자연어 수정을 가능하게 합니다. 실험 결과, IVA는 거짓 전제 감지 정확도를 기준선 대비 97.56% 향상시키고, 거짓 전제 시나리오에서 성공적인 응답률을 50.78% 증가시킴을 보여줍니다.

English

Recently, Vision-Language-Action (VLA) models have demonstrated strong performance on a range of robotic tasks. These models rely on multimodal inputs, with language instructions playing a crucial role -- not only in predicting actions, but also in robustly interpreting user intent, even when the requests are impossible to fulfill. In this work, we investigate how VLAs can recognize, interpret, and respond to false-premise instructions: natural language commands that reference objects or conditions absent from the environment. We propose Instruct-Verify-and-Act (IVA), a unified framework that (i) detects when an instruction cannot be executed due to a false premise, (ii) engages in language-based clarification or correction, and (iii) grounds plausible alternatives in perception and action. Towards this end, we construct a large-scale instruction tuning setup with structured language prompts and train a VLA model capable of handling both accurate and erroneous requests. Our approach leverages a contextually augmented, semi-synthetic dataset containing paired positive and false-premise instructions, enabling robust detection and natural language correction. Our experiments show that IVA improves false premise detection accuracy by 97.56% over baselines, while increasing successful responses in false-premise scenarios by 50.78%.

무엇을 할까? 비전-언어-행동 모델이 불가능한 것을 거부하도록 가르치기

Do What? Teaching Vision-Language-Action Models to Reject the Impossible

초록

Support