做什麼？教導視覺-語言-動作模型拒絕不可能

摘要

近期，視覺-語言-動作（VLA）模型在一系列機器人任務中展現了卓越的性能。這些模型依賴於多模態輸入，其中語言指令扮演著關鍵角色——不僅在預測動作方面，還在於即使請求無法實現時，也能穩健地解讀用戶意圖。在本研究中，我們探討了VLA模型如何識別、解釋並回應基於錯誤前提的指令：這些自然語言命令涉及環境中不存在的物體或條件。我們提出了「指令-驗證-執行」（IVA）這一統一框架，該框架能夠（i）檢測因錯誤前提而無法執行的指令，（ii）進行基於語言的澄清或修正，以及（iii）將可行的替代方案基於感知和動作進行落地。為此，我們構建了一個大規模的指令微調設置，包含結構化的語言提示，並訓練了一個能夠處理準確與錯誤請求的VLA模型。我們的方法利用了上下文增強的半合成數據集，其中包含配對的正確與錯誤前提指令，從而實現了穩健的檢測和自然語言修正。實驗結果顯示，IVA在錯誤前提檢測準確率上較基準提升了97.56%，同時在錯誤前提情境下的成功回應率提高了50.78%。

English

Recently, Vision-Language-Action (VLA) models have demonstrated strong performance on a range of robotic tasks. These models rely on multimodal inputs, with language instructions playing a crucial role -- not only in predicting actions, but also in robustly interpreting user intent, even when the requests are impossible to fulfill. In this work, we investigate how VLAs can recognize, interpret, and respond to false-premise instructions: natural language commands that reference objects or conditions absent from the environment. We propose Instruct-Verify-and-Act (IVA), a unified framework that (i) detects when an instruction cannot be executed due to a false premise, (ii) engages in language-based clarification or correction, and (iii) grounds plausible alternatives in perception and action. Towards this end, we construct a large-scale instruction tuning setup with structured language prompts and train a VLA model capable of handling both accurate and erroneous requests. Our approach leverages a contextually augmented, semi-synthetic dataset containing paired positive and false-premise instructions, enabling robust detection and natural language correction. Our experiments show that IVA improves false premise detection accuracy by 97.56% over baselines, while increasing successful responses in false-premise scenarios by 50.78%.

做什麼？教導視覺-語言-動作模型拒絕不可能

Do What? Teaching Vision-Language-Action Models to Reject the Impossible

摘要

Support