Maestro: 계층적 모델-스킬 앙상블을 조율하기 위한 강화 학습

초록

대규모 언어 모델(LLM)과 모듈형 스킬의 확산은 자율 에이전트에게 점점 더 강력한 능력을 부여하고 있다. 기존 프레임워크는 일반적으로 단일 LLM과 고정된 로직에 의존하여 이러한 스킬들과 인터페이스한다. 이는 중요한 병목 현상을 야기한다: 서로 다른 LLM은 다양한 도메인에서 뚜렷한 장점을 제공하지만, 현재 프레임워크는 모델과 스킬의 상호 보완적 강점을 활용하지 못하여 하위 작업의 성능을 제한한다. 본 논문에서는 이종 멀티모달 작업을 계층적 모델-스킬 레지스트리에 대한 순차적 의사 결정 과정으로 재구성하는 강화 학습(RL) 기반 오케스트레이션 프레임워크인 Maestro(Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration)를 제시한다. Maestro는 모든 지식을 단일 모델에 통합하는 대신, 경량 정책을 학습하여 고정된 전문가 모델과 2계층 스킬 라이브러리로 구성된 앙상블을 동적으로 구성하고, 각 단계에서 외부 전문가를 호출할지, 어떤 모델-스킬 쌍을 선택할지, 그리고 언제 종료할지를 결정한다. 정책은 결과 기반 RL을 통해 최적화되며, 단계별 지도 학습이 필요하지 않다. 우리는 수학적 추론, 차트 이해, 고해상도 인식, 도메인 특화 분석을 아우르는 10개의 대표적인 멀티모달 벤치마크에서 Maestro를 평가한다. 4B 오케스트레이터만으로 Maestro는 평균 정확도 70.1%를 달성하여 GPT-5(69.3%)와 Gemini-2.5-Pro(68.7%)를 모두 능가한다. 결정적으로, 학습된 조정 정책은 재학습 없이도 보이지 않는 모델과 스킬에 일반화된다: 도메인 외 전문가를 레지스트리에 추가하면 네 개의 까다로운 벤치마크에서 평균 59.5%를 기록하여 모든 폐쇄형 소스 기준을 능가한다. Maestro는 또한 낮은 지연 시간으로 높은 계산 효율성을 유지한다. 소스 코드는 https://github.com/jinyangwu/Maestro에서 확인할 수 있다.

English

The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)-driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision-making process over a hierarchical model-skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two-tier skill library, deciding at each step whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized via outcome-based RL, requiring no step-level supervision. We evaluate Maestro across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis. With only a 4B orchestrator, Maestro achieves an average accuracy of 70.1%, surpassing both GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out-of-domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed-source baselines. Maestro further maintains high computational efficiency with low latency. The source code is available at https://github.com/jinyangwu/Maestro.