Orchestra-o1：全模態智能體編排

摘要

近年来，智能体群的成功应用将基于大语言模型的智能体从单智能体工作流范式转向多智能体系统，凸显了任务分解与协作中智能体编排的重要性。然而，现有编排框架局限于少数模态类型，难以泛化到异构模态共存互动的复杂场景。这一问题在全模态场景中尤为突出——此类任务要求统一理解并协调文本、图像、音频和视频等多元输入。本研究提出全模态智能体编排框架Orchestra-o1，旨在支持跨多模态的高效智能体协作。Orchestra-o1引入统一编排机制，实现模态感知的任务分解、在线子智能体专业化及并行子任务执行。这种可扩展设计使智能体系统能有效应对涉及异构信息源的复杂现实任务，在OmniGAIA基准测试中准确率超越第二名方法10.3%。此外，我们提出决策对齐组相对策略优化（DA-GRPO），这是一种高效的智能体强化学习方法，用于训练Orchestra-o1-8B模型，使其在所有现有开源全模态智能体中达到最先进性能。

English

The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.