VIKI-R：通过强化学习协调具身多智能体协作

摘要

在动态环境中协调多个具身代理仍然是人工智能领域的一个核心挑战，这既需要感知驱动的推理，也需要可扩展的合作策略。尽管最近的研究已利用大型语言模型（LLMs）进行多代理规划，但少数研究开始探索视觉语言模型（VLMs）在视觉推理中的应用。然而，这些基于VLM的方法在支持多样化具身类型方面仍显不足。在本研究中，我们引入了VIKI-Bench，这是首个专为具身多代理合作设计的层次化基准测试，包含三个结构化层次：代理激活、任务规划和轨迹感知。VIKI-Bench涵盖了多样化的机器人具身形式、多视角视觉观察以及结构化监督信号，以评估基于视觉输入的推理能力。为了展示VIKI-Bench的实用性，我们提出了VIKI-R，一个两阶段框架，该框架首先使用思维链标注的示范对预训练的视觉语言模型（VLM）进行微调，随后在多层次奖励信号下进行强化学习。我们的大量实验表明，VIKI-R在所有任务层次上均显著优于基线方法。此外，我们还展示了强化学习能够促进异构代理之间出现组合式合作模式。VIKI-Bench与VIKI-R共同为推进具身AI系统中的多代理、视觉驱动合作提供了一个统一的测试平台与方法。

English

Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.

VIKI-R：通过强化学习协调具身多智能体协作

VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

摘要

Support