ChatPaper.aiChatPaper

VIKI-R:通过强化学习协调具身多智能体协作

VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

June 10, 2025
作者: Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin
cs.AI

摘要

在动态环境中协调多个具身智能体仍然是人工智能领域的核心挑战,这既需要基于感知的推理能力,也需要可扩展的协作策略。尽管近期研究已利用大语言模型(LLMs)进行多智能体规划,但仅有少数工作开始探索视觉语言模型(VLMs)在视觉推理中的应用。然而,这些基于VLM的方法在支持多样化具身形态方面仍存在局限。本研究中,我们提出了VIKI-Bench,这是首个专为具身多智能体协作设计的层次化基准测试,包含三个结构化层级:智能体激活、任务规划与轨迹感知。VIKI-Bench集成了多样化的机器人具身形态、多视角视觉观察以及结构化监督信号,以评估基于视觉输入的推理能力。为展示VIKI-Bench的实用性,我们提出了VIKI-R,一个两阶段框架:首先利用链式思维标注的示范微调预训练的视觉语言模型(VLM),随后在多级奖励信号下进行强化学习。大量实验表明,VIKI-R在所有任务层级上均显著超越基线方法。此外,我们还发现强化学习能够促进异构智能体间组合式协作模式的形成。VIKI-Bench与VIKI-R共同为推进具身AI系统中多智能体、视觉驱动的协作提供了一个统一的测试平台与方法论。
English
Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.
PDF312June 23, 2025