Ask-to-Clarify: 다중 턴 대화를 통한 명령어 모호성 해결

초록

구현된 에이전트의 궁극적인 목표는 단순히 지시를 수동적으로 실행하는 존재가 아니라 인간과 상호작용할 수 있는 협력자를 만드는 것입니다. 이를 위해서는 에이전트가 의사소통하고, 조율하며, 인간의 피드백에 따라 행동을 조정할 수 있어야 합니다. 최근 VLA(Vision-Language-Action) 분야의 발전이 이러한 목표를 향한 길을 제시하고 있습니다. 그러나 현재 대부분의 VLA 기반 구현 에이전트는 일방향 모드로 작동합니다: 지시를 받고 피드백 없이 실행합니다. 이러한 접근 방식은 지시가 종종 모호한 실제 시나리오에서는 실패합니다. 본 논문에서는 이러한 문제를 'Ask-to-Clarify' 프레임워크로 해결합니다. 우리의 프레임워크는 먼저 다중 턴 대화를 통해 질문을 함으로써 모호한 지시를 해결합니다. 그런 다음 엔드투엔드 방식으로 저수준 행동을 생성합니다. 구체적으로, Ask-to-Clarify 프레임워크는 협력을 위한 하나의 VLM(Vision-Language Model)과 행동 생성을 위한 하나의 디퓨전 모델로 구성됩니다. 또한 VLM의 출력을 기반으로 디퓨전을 위한 조건을 생성하는 연결 모듈을 도입했습니다. 이 모듈은 지시에 따라 관측을 조정하여 신뢰할 수 있는 조건을 만듭니다. 우리는 두 단계의 지식 분리 전략으로 프레임워크를 학습시킵니다. 먼저, 모호성 해결 대화 데이터를 사용하여 협력 구성 요소를 미세 조정하여 모호성을 처리합니다. 그런 다음, 협력 구성 요소를 고정한 상태에서 행동 구성 요소를 통합합니다. 이는 상호작용 능력을 유지하면서 디퓨전을 미세 조정하여 행동을 생성합니다. 이 학습 전략은 우리 프레임워크가 먼저 질문을 하고, 그 다음 행동을 생성할 수 있도록 보장합니다. 추론 과정에서 신호 감지기는 우리 프레임워크가 질문과 행동 사이를 전환하도록 돕는 라우터 역할을 합니다. 우리는 Ask-to-Clarify 프레임워크를 8가지 실제 작업에서 평가했으며, 기존의 최첨단 VLA를 능가하는 성능을 보였습니다. 결과는 우리가 제안한 프레임워크와 학습 전략이 협력적인 구현 에이전트를 향한 길을 제공한다는 것을 시사합니다.

English

The ultimate goal of embodied agents is to create collaborators that can interact with humans, not mere executors that passively follow instructions. This requires agents to communicate, coordinate, and adapt their actions based on human feedback. Recently, advances in VLAs have offered a path toward this goal. However, most current VLA-based embodied agents operate in a one-way mode: they receive an instruction and execute it without feedback. This approach fails in real-world scenarios where instructions are often ambiguous. In this paper, we address this problem with the Ask-to-Clarify framework. Our framework first resolves ambiguous instructions by asking questions in a multi-turn dialogue. Then it generates low-level actions end-to-end. Specifically, the Ask-to-Clarify framework consists of two components, one VLM for collaboration and one diffusion for action. We also introduce a connection module that generates conditions for the diffusion based on the output of the VLM. This module adjusts the observation by instructions to create reliable conditions. We train our framework with a two-stage knowledge-insulation strategy. First, we fine-tune the collaboration component using ambiguity-solving dialogue data to handle ambiguity. Then, we integrate the action component while freezing the collaboration one. This preserves the interaction abilities while fine-tuning the diffusion to generate actions. The training strategy guarantees our framework can first ask questions, then generate actions. During inference, a signal detector functions as a router that helps our framework switch between asking questions and taking actions. We evaluate the Ask-to-Clarify framework in 8 real-world tasks, where it outperforms existing state-of-the-art VLAs. The results suggest that our proposed framework, along with the training strategy, provides a path toward collaborative embodied agents.

Ask-to-Clarify: 다중 턴 대화를 통한 명령어 모호성 해결

Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue

초록

Support