Orchestra-o1: 옴니모달 에이전트 오케스트레이션

초록

최근 에이전트 스웜의 성공은 대규모 언어 모델(LLM) 기반 에이전트의 패러다임을 단일 에이전트 워크플로우에서 다중 에이전트 시스템으로 전환시키며, 작업 분해 및 협업을 위한 에이전트 오케스트레이션의 중요성을 부각시켰다. 그러나 기존 오케스트레이션 프레임워크는 제한된 모달리티 집합에 국한되어 이질적 모달리티가 공존하고 상호작용하는 더 복잡한 환경으로 일반화하는 데 어려움을 겪는다. 이러한 한계는 텍스트, 이미지, 오디오, 비디오와 같은 다양한 입력에 대한 통합된 이해와 조정이 요구되는 옴니모달 시나리오에서 특히 두드러진다. 본 연구에서는 다중 모달리티에 걸친 효율적인 에이전트 협업을 지원하도록 설계된 옴니모달 에이전트 오케스트레이션 프레임워크인 Orchestra-o1을 제안한다. Orchestra-o1은 모달리티 인식 작업 분해, 온라인 서브 에이전트 전문화, 병렬 서브 태스크 실행을 가능하게 하는 통일된 오케스트레이션 메커니즘을 도입한다. 이러한 확장 가능한 설계는 에이전트 시스템이 이질적 정보 소스를 포함하는 복잡한 실제 작업을 효과적으로 처리할 수 있게 하며, OmniGAIA 벤치마크에서 두 번째로 우수한 접근법보다 정확도가 10.3% 향상되었다. 또한, 결정 정렬 그룹 상대 정책 최적화(DA-GRPO)를 도입하여 Orchestra-o1-8B를 훈련하기 위한 효율적인 에이전틱 강화 학습 접근법을 제시하며, 이는 기존의 모든 오픈소스 옴니모달 에이전트 대비 최첨단 성능을 달성한다.

English

The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.