何をする？視覚-言語-行動モデルに不可能なことを拒否する方法を教える

要旨

近年、Vision-Language-Action（VLA）モデルは、さまざまなロボットタスクにおいて優れた性能を示しています。これらのモデルはマルチモーダルな入力を利用し、言語指示が重要な役割を果たしています。言語指示は、アクションを予測するだけでなく、要求が実行不可能な場合でもユーザーの意図を堅牢に解釈する上で重要な役割を担っています。本研究では、VLAモデルが偽前提指示（環境に存在しないオブジェクトや条件を参照する自然言語コマンド）を認識し、解釈し、応答する方法を調査します。私たちは、Instruct-Verify-and-Act（IVA）という統一フレームワークを提案します。このフレームワークは、(i) 偽前提のために指示が実行できない場合を検出し、(ii) 言語ベースの明確化や修正を行い、(iii) 知覚とアクションに基づいて妥当な代替案を提供します。この目的に向けて、構造化された言語プロンプトを用いた大規模な指示チューニングセットアップを構築し、正確な要求と誤った要求の両方を処理できるVLAモデルを訓練します。私たちのアプローチは、正しい指示と偽前提指示をペアにした文脈的に拡張された半合成データセットを活用し、堅牢な検出と自然言語による修正を可能にします。実験結果は、IVAが偽前提検出の精度をベースラインと比較して97.56％向上させ、偽前提シナリオでの成功応答率を50.78％増加させることを示しています。

English

Recently, Vision-Language-Action (VLA) models have demonstrated strong performance on a range of robotic tasks. These models rely on multimodal inputs, with language instructions playing a crucial role -- not only in predicting actions, but also in robustly interpreting user intent, even when the requests are impossible to fulfill. In this work, we investigate how VLAs can recognize, interpret, and respond to false-premise instructions: natural language commands that reference objects or conditions absent from the environment. We propose Instruct-Verify-and-Act (IVA), a unified framework that (i) detects when an instruction cannot be executed due to a false premise, (ii) engages in language-based clarification or correction, and (iii) grounds plausible alternatives in perception and action. Towards this end, we construct a large-scale instruction tuning setup with structured language prompts and train a VLA model capable of handling both accurate and erroneous requests. Our approach leverages a contextually augmented, semi-synthetic dataset containing paired positive and false-premise instructions, enabling robust detection and natural language correction. Our experiments show that IVA improves false premise detection accuracy by 97.56% over baselines, while increasing successful responses in false-premise scenarios by 50.78%.

何をする？視覚-言語-行動モデルに不可能なことを拒否する方法を教える

Do What? Teaching Vision-Language-Action Models to Reject the Impossible

要旨

Support