詢問以澄清：通過多輪對話解決指令模糊性

摘要

具身代理的终极目标是创造能够与人类互动的合作者，而非仅仅被动执行指令的执行者。这要求代理能够基于人类反馈进行沟通、协调并调整其行动。近年来，视觉语言代理（VLA）的进展为实现这一目标提供了途径。然而，当前大多数基于VLA的具身代理以单向模式运作：接收指令后即执行，缺乏反馈机制。这种方法在现实场景中往往失效，因为指令常存在模糊性。本文通过“询问以澄清”框架解决了这一问题。该框架首先通过多轮对话提问来消除指令的模糊性，随后端到端生成低层次动作。具体而言，“询问以澄清”框架包含两个组件：一个用于协作的视觉语言模型（VLM）和一个用于动作生成的扩散模型。我们还引入了一个连接模块，该模块根据VLM的输出为扩散模型生成条件，通过指令调整观察以创建可靠条件。我们采用两阶段知识隔离策略训练该框架：首先，利用解决模糊性的对话数据微调协作组件以处理模糊性；然后，在冻结协作组件的同时整合动作组件，确保在微调扩散模型生成动作时保留交互能力。这一训练策略保证了框架能够先提问后生成动作。在推理过程中，信号检测器作为路由器，帮助框架在提问与执行动作之间切换。我们在8个现实任务中评估了“询问以澄清”框架，其表现优于现有最先进的VLA。结果表明，我们提出的框架及训练策略为开发协作型具身代理提供了一条可行路径。

English

The ultimate goal of embodied agents is to create collaborators that can interact with humans, not mere executors that passively follow instructions. This requires agents to communicate, coordinate, and adapt their actions based on human feedback. Recently, advances in VLAs have offered a path toward this goal. However, most current VLA-based embodied agents operate in a one-way mode: they receive an instruction and execute it without feedback. This approach fails in real-world scenarios where instructions are often ambiguous. In this paper, we address this problem with the Ask-to-Clarify framework. Our framework first resolves ambiguous instructions by asking questions in a multi-turn dialogue. Then it generates low-level actions end-to-end. Specifically, the Ask-to-Clarify framework consists of two components, one VLM for collaboration and one diffusion for action. We also introduce a connection module that generates conditions for the diffusion based on the output of the VLM. This module adjusts the observation by instructions to create reliable conditions. We train our framework with a two-stage knowledge-insulation strategy. First, we fine-tune the collaboration component using ambiguity-solving dialogue data to handle ambiguity. Then, we integrate the action component while freezing the collaboration one. This preserves the interaction abilities while fine-tuning the diffusion to generate actions. The training strategy guarantees our framework can first ask questions, then generate actions. During inference, a signal detector functions as a router that helps our framework switch between asking questions and taking actions. We evaluate the Ask-to-Clarify framework in 8 real-world tasks, where it outperforms existing state-of-the-art VLAs. The results suggest that our proposed framework, along with the training strategy, provides a path toward collaborative embodied agents.

詢問以澄清：通過多輪對話解決指令模糊性

Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue

摘要

Support