Maestro：强化学习编排层级化模型-技能集成

摘要

大语言模型（LLMs）与模块化技能的普及，使自主智能体具备了日益强大的能力。现有框架通常依赖单一的大型语言模型及固定逻辑来调用这些技能，这引发了关键瓶颈：不同模型在多个领域各具优势，但当前框架无法充分利用模型与技能间的互补性，从而限制了其在下游任务上的性能。本文提出Maestro（多模态智能体专家技能目标强化编排框架）——一种由强化学习驱动的编排框架，将异构多模态任务重塑为基于层次化模型-技能注册表的序列决策过程。与将全部知识集中到单一模型不同，Maestro训练了一个轻量级策略，动态组合冻结专家模型与双层技能库的集成体，在每一步决策是否调用外部专家、选择哪一对模型-技能组合，以及何时终止。该策略通过基于结果的强化学习进行优化，无需步骤级监督。我们在涵盖数学推理、图表理解、高分辨率感知及领域特定分析等十个代表性多模态基准上评估Maestro。仅使用4B大小的编排器，Maestro便取得了70.1%的平均准确率，超越了GPT-5（69.3%）和Gemini-2.5-Pro（68.7%）。关键在于，学到的协调策略能够泛化到未见过的模型和技能，无需重新训练：在注册表中加入域外专家后，Maestro在四个具有挑战性的基准上平均得分59.5%，超越了所有闭源基线。Maestro同时保持了较高的计算效率与低延迟。源代码已开源至 https://github.com/jinyangwu/Maestro。

English

The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)-driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision-making process over a hierarchical model-skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two-tier skill library, deciding at each step whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized via outcome-based RL, requiring no step-level supervision. We evaluate Maestro across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis. With only a 4B orchestrator, Maestro achieves an average accuracy of 70.1%, surpassing both GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out-of-domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed-source baselines. Maestro further maintains high computational efficiency with low latency. The source code is available at https://github.com/jinyangwu/Maestro.