Orchestra-o1: Omnimodale Agentorkestratie

Samenvatting

Het recente succes van agentenzwermen heeft het paradigma van op grote taalmodellen (LLM) gebaseerde agenten verschoven van single-agent workflows naar multi-agentsystemen, waarbij het belang van agentorchestratie voor taakdecompositie en samenwerking wordt benadrukt. Bestaande orkestratieframeworks zijn echter beperkt tot een kleine set modaliteiten en kunnen moeilijk generaliseren naar complexere omgevingen waarin heterogene modaliteiten naast elkaar bestaan en interageren. Deze beperking wordt met name duidelijk in omnimodale scenario's, waar taken een uniform begrip en coördinatie vereisen van uiteenlopende inputs zoals tekst, beeld, audio en video. In dit werk introduceren we Orchestra-o1, een omnimodaal agentorkestratieframework dat is ontworpen om efficiënte agentsamenwerking over meerdere modaliteiten te ondersteunen. Orchestra-o1 introduceert een uniform orkestratiemechanisme dat modaliteitsbewuste taakdecompositie, online sub-agentspecialisatie en parallelle subtaakuitvoering mogelijk maakt. Dit schaalbare ontwerp stelt agentsystemen in staat om effectief complexe real-world taken met heterogene informatiebronnen aan te pakken, waarbij het de op één na beste benadering met 10,3% nauwkeurigheid overtreft op de OmniGAIA-benchmark. Verder introduceren we decision-aligned group relative policy optimization (DA-GRPO), een efficiënte agentische reinforcement learning-benadering voor het trainen van Orchestra-o1-8B, die ook state-of-the-art prestaties behaalt ten opzichte van alle bestaande open-source omnimodale agenten.

English

The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.