Ask-to-Clarify: マルチターン対話による指示の曖昧さの解消

要旨

具現化エージェントの究極の目標は、人間と対話できる協力者を創出することであり、単に指示に従う受動的な実行者ではない。これには、エージェントがコミュニケーションを取り、調整し、人間のフィードバックに基づいて行動を適応させる能力が求められる。最近のVLA（Vision-Language-Action）の進展は、この目標に向けた道筋を提供している。しかし、現在のVLAベースの具現化エージェントの多くは一方向モードで動作しており、指示を受け取り、フィードバックなしにそれを実行する。このアプローチは、指示が曖昧であることが多い現実世界のシナリオでは失敗する。本論文では、この問題を「Ask-to-Clarify」フレームワークで解決する。このフレームワークは、まず曖昧な指示を多段階の対話を通じて質問することで解決し、次にエンドツーエンドで低レベルのアクションを生成する。具体的には、Ask-to-Clarifyフレームワークは、協力のためのVLM（Vision-Language Model）とアクションのための拡散モデルの2つのコンポーネントで構成される。また、VLMの出力に基づいて拡散モデルの条件を生成する接続モジュールを導入する。このモジュールは、指示に基づいて観測を調整し、信頼性のある条件を作成する。我々は、2段階の知識隔離戦略を用いてフレームワークを訓練する。まず、曖昧さを解決する対話データを使用して協力コンポーネントを微調整し、曖昧さに対処する。次に、協力コンポーネントを凍結したままアクションコンポーネントを統合する。これにより、拡散モデルを微調整してアクションを生成する際に対話能力を保持する。この訓練戦略により、フレームワークはまず質問を行い、次にアクションを生成できることが保証される。推論時には、信号検出器がルーターとして機能し、フレームワークが質問とアクションの切り替えを支援する。我々は、Ask-to-Clarifyフレームワークを8つの現実世界のタスクで評価し、既存の最先端のVLAを上回る性能を示した。結果は、提案されたフレームワークと訓練戦略が、協力的な具現化エージェントに向けた道筋を提供することを示唆している。

English

The ultimate goal of embodied agents is to create collaborators that can interact with humans, not mere executors that passively follow instructions. This requires agents to communicate, coordinate, and adapt their actions based on human feedback. Recently, advances in VLAs have offered a path toward this goal. However, most current VLA-based embodied agents operate in a one-way mode: they receive an instruction and execute it without feedback. This approach fails in real-world scenarios where instructions are often ambiguous. In this paper, we address this problem with the Ask-to-Clarify framework. Our framework first resolves ambiguous instructions by asking questions in a multi-turn dialogue. Then it generates low-level actions end-to-end. Specifically, the Ask-to-Clarify framework consists of two components, one VLM for collaboration and one diffusion for action. We also introduce a connection module that generates conditions for the diffusion based on the output of the VLM. This module adjusts the observation by instructions to create reliable conditions. We train our framework with a two-stage knowledge-insulation strategy. First, we fine-tune the collaboration component using ambiguity-solving dialogue data to handle ambiguity. Then, we integrate the action component while freezing the collaboration one. This preserves the interaction abilities while fine-tuning the diffusion to generate actions. The training strategy guarantees our framework can first ask questions, then generate actions. During inference, a signal detector functions as a router that helps our framework switch between asking questions and taking actions. We evaluate the Ask-to-Clarify framework in 8 real-world tasks, where it outperforms existing state-of-the-art VLAs. The results suggest that our proposed framework, along with the training strategy, provides a path toward collaborative embodied agents.

Ask-to-Clarify: マルチターン対話による指示の曖昧さの解消

Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue

要旨

Support