Orchestra-o1: 全模态智能体编排
Orchestra-o1: Omnimodal Agent Orchestration
June 10, 2026
作者: Fan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Hao Wu, Jinyang Wu, Donghao Zhou, Zhihong Zhu, Zheng Lian, Xin Wang, Pheng-Ann Heng
cs.AI
摘要
近期智能体集群的成功应用,将基于大语言模型的智能体范式从单智能体工作流转向多智能体系统,凸显了智能体编排在任务分解与协作中的关键作用。然而现有编排框架仅能支持有限模态类型,难以泛化至异构模态共存且相互交互的复杂场景。这种局限在全模态情境下尤为突出——此类任务要求对文本、图像、音频、视频等多源输入进行统一理解与协调。为此,本文提出Orchestra-o1全模态智能体编排框架,旨在支持跨模态的高效智能体协作。Orchestra-o1通过统一编排机制实现模态感知的任务分解、在线子智能体专业化分工及并行子任务执行。这种可扩展设计使智能体系统能有效处理涉及异构信息源的复杂现实任务——在OmniGAIA基准测试中,其准确率较次优方法提升10.3%。此外,我们提出决策对齐的群体相对策略优化方法,这是一种高效的智能体强化学习训练策略,用于训练Orchestra-o1-8B模型,该模型在所有现有开源全模态智能体中亦达到最优性能。
English
The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.