VIKI-R：通过强化学习协调具身多智能体协作

摘要

在动态环境中协调多个具身智能体仍然是人工智能领域的核心挑战，这既需要基于感知的推理能力，也需要可扩展的协作策略。尽管近期研究已利用大语言模型（LLMs）进行多智能体规划，但仅有少数工作开始探索视觉语言模型（VLMs）在视觉推理中的应用。然而，这些基于VLM的方法在支持多样化具身形态方面仍存在局限。本研究中，我们提出了VIKI-Bench，这是首个专为具身多智能体协作设计的层次化基准测试，包含三个结构化层级：智能体激活、任务规划与轨迹感知。VIKI-Bench集成了多样化的机器人具身形态、多视角视觉观察以及结构化监督信号，以评估基于视觉输入的推理能力。为展示VIKI-Bench的实用性，我们提出了VIKI-R，一个两阶段框架：首先利用链式思维标注的示范微调预训练的视觉语言模型（VLM），随后在多级奖励信号下进行强化学习。大量实验表明，VIKI-R在所有任务层级上均显著超越基线方法。此外，我们还发现强化学习能够促进异构智能体间组合式协作模式的形成。VIKI-Bench与VIKI-R共同为推进具身AI系统中多智能体、视觉驱动的协作提供了一个统一的测试平台与方法论。

English

Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.

VIKI-R：通过强化学习协调具身多智能体协作

VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

摘要

Support