Maestro：透過強化學習編排階層式模型技能集成系統

摘要

大型語言模型（LLMs）與模組化技能的普及，賦予自主代理日益強大的能力。現有框架通常依賴單一的大型語言模型與固定邏輯來調用這些技能，這導致一個關鍵瓶頸：不同的LLM在不同領域各有優勢，然而現有框架未能充分利用模型與技能之間的互補特性，從而限制了其在下游任務上的表現。本文提出Maestro（Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration），一個基於強化學習（RL）的編排框架，將異質多模態任務重新定義為在層級式模型-技能註冊表上的序列決策過程。Maestro並非將所有知識整合至單一模型，而是訓練一個輕量級策略，動態組合凍結的專家模型與雙層技能庫，每一步決定是否調用外部專家、選擇哪個模型-技能配對，以及何時終止。該策略透過基於結果的RL進行最佳化，無需步驟層級監督。我們在十個具代表性的多模態基準上評估Maestro，涵蓋數學推理、圖表理解、高解析度感知及領域特定分析。僅使用4B的編排器，Maestro平均準確率達70.1%，超越GPT-5（69.3%）與Gemini-2.5-Pro（68.7%）。關鍵的是，所學到的協調策略能泛化至未見過的模型與技能，無需重新訓練：在註冊表中加入領域外專家後，Maestro在四個具挑戰性的基準上平均達59.5%，超越所有閉源基線。此外，Maestro保持高計算效率與低延遲。原始碼已公開於 https://github.com/jinyangwu/Maestro。

English

The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)-driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision-making process over a hierarchical model-skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two-tier skill library, deciding at each step whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized via outcome-based RL, requiring no step-level supervision. We evaluate Maestro across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis. With only a 4B orchestrator, Maestro achieves an average accuracy of 70.1%, surpassing both GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out-of-domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed-source baselines. Maestro further maintains high computational efficiency with low latency. The source code is available at https://github.com/jinyangwu/Maestro.