VIKI-R：強化学習による身体化マルチエージェント協調の調整

要旨

動的な環境における複数のエンボディエージェントの調整は、人工知能における中核的な課題であり、知覚駆動型の推論とスケーラブルな協調戦略の両方を必要とします。近年の研究では、大規模言語モデル（LLM）を活用したマルチエージェント計画が進められていますが、視覚言語モデル（VLM）を用いた視覚的推論の探求はまだ始まったばかりです。しかし、これらのVLMベースのアプローチは、多様なエンボディメントタイプをサポートする点で限界があります。本研究では、エンボディドマルチエージェント協調に特化した初の階層型ベンチマークであるVIKI-Benchを紹介します。VIKI-Benchは、エージェントの活性化、タスク計画、軌道知覚という3つの構造化されたレベルを特徴としています。VIKI-Benchには、多様なロボットエンボディメント、マルチビュー視覚観察、および視覚入力を基にした推論を評価するための構造化された監督信号が含まれています。VIKI-Benchの有用性を示すために、Chain-of-Thought注釈付きデモンストレーションを使用して事前学習済みの視覚言語モデル（VLM）を微調整し、その後、多段階の報酬信号の下で強化学習を行う2段階のフレームワークであるVIKI-Rを提案します。我々の広範な実験により、VIKI-Rがすべてのタスクレベルにおいてベースラインメソッドを大幅に上回ることが示されました。さらに、強化学習により、異種エージェント間での合成的な協調パターンが出現することが明らかになりました。VIKI-BenchとVIKI-Rは、エンボディドAIシステムにおけるマルチエージェント視覚駆動型協調を進展させるための統一されたテストベッドと手法を提供します。

English

Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.

VIKI-R：強化学習による身体化マルチエージェント協調の調整

VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

要旨

Support