Ask-to-Clarify: Risolvere l'ambiguità delle istruzioni attraverso dialoghi multi-turn

Abstract

L'obiettivo finale degli agenti incarnati è creare collaboratori in grado di interagire con gli esseri umani, non semplici esecutori che seguono passivamente le istruzioni. Ciò richiede che gli agenti comunichino, si coordinino e adattino le loro azioni in base al feedback umano. Recentemente, i progressi nei VLAs (Vision-Language-Action models) hanno offerto una strada verso questo obiettivo. Tuttavia, la maggior parte degli attuali agenti incarnati basati su VLA opera in modalità unidirezionale: ricevono un'istruzione e la eseguono senza feedback. Questo approccio fallisce negli scenari del mondo reale, dove le istruzioni sono spesso ambigue. In questo articolo, affrontiamo questo problema con il framework Ask-to-Clarify. Il nostro framework risolve prima le istruzioni ambigue ponendo domande in un dialogo a più turni. Poi genera azioni di basso livello end-to-end. Nello specifico, il framework Ask-to-Clarify è composto da due componenti, un VLM per la collaborazione e un modello di diffusione per l'azione. Introduciamo anche un modulo di connessione che genera condizioni per il modello di diffusione basandosi sull'output del VLM. Questo modulo adatta l'osservazione in base alle istruzioni per creare condizioni affidabili. Addestriamo il nostro framework con una strategia di isolamento della conoscenza in due fasi. Prima, ottimizziamo il componente di collaborazione utilizzando dati di dialogo per la risoluzione delle ambiguità. Poi, integriamo il componente di azione mantenendo congelato quello di collaborazione. Ciò preserva le capacità di interazione mentre ottimizziamo il modello di diffusione per generare azioni. La strategia di addestramento garantisce che il nostro framework possa prima porre domande e poi generare azioni. Durante l'inferenza, un rilevatore di segnali funge da router che aiuta il nostro framework a passare dal porre domande all'intraprendere azioni. Valutiamo il framework Ask-to-Clarify in 8 task del mondo reale, dove supera gli attuali VLAs all'avanguardia. I risultati suggeriscono che il nostro framework proposto, insieme alla strategia di addestramento, fornisce una strada verso agenti incarnati collaborativi.

English

The ultimate goal of embodied agents is to create collaborators that can interact with humans, not mere executors that passively follow instructions. This requires agents to communicate, coordinate, and adapt their actions based on human feedback. Recently, advances in VLAs have offered a path toward this goal. However, most current VLA-based embodied agents operate in a one-way mode: they receive an instruction and execute it without feedback. This approach fails in real-world scenarios where instructions are often ambiguous. In this paper, we address this problem with the Ask-to-Clarify framework. Our framework first resolves ambiguous instructions by asking questions in a multi-turn dialogue. Then it generates low-level actions end-to-end. Specifically, the Ask-to-Clarify framework consists of two components, one VLM for collaboration and one diffusion for action. We also introduce a connection module that generates conditions for the diffusion based on the output of the VLM. This module adjusts the observation by instructions to create reliable conditions. We train our framework with a two-stage knowledge-insulation strategy. First, we fine-tune the collaboration component using ambiguity-solving dialogue data to handle ambiguity. Then, we integrate the action component while freezing the collaboration one. This preserves the interaction abilities while fine-tuning the diffusion to generate actions. The training strategy guarantees our framework can first ask questions, then generate actions. During inference, a signal detector functions as a router that helps our framework switch between asking questions and taking actions. We evaluate the Ask-to-Clarify framework in 8 real-world tasks, where it outperforms existing state-of-the-art VLAs. The results suggest that our proposed framework, along with the training strategy, provides a path toward collaborative embodied agents.

Ask-to-Clarify: Risolvere l'ambiguità delle istruzioni attraverso dialoghi multi-turn

Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue

Abstract

Support