Router-R1: 강화 학습을 통한 LLM의 다중 라운드 라우팅 및 집계 교육

초록

다양한 대규모 언어 모델(LLM)의 급속한 등장은 사용자 쿼리를 가장 적합한 모델에 할당하는 LLM 라우터의 개발을 촉진시켰습니다. 그러나 기존의 LLM 라우터는 일반적으로 단일 라운드, 일대일 매핑(즉, 각 쿼리를 단일 모델에 독립적으로 할당)을 수행하므로, 여러 LLM의 상호 보완적인 강점을 요구하는 복잡한 작업을 처리하는 데 한계가 있습니다. 본 논문에서는 다중 LLM 라우팅 및 집계를 순차적 의사결정 과정으로 공식화하는 강화 학습(RL) 기반 프레임워크인 Router-R1을 소개합니다. Router-R1은 라우터 자체를 능력 있는 LLM으로 구현하여, 그 추론 능력을 활용해 "생각" 행동(내부 숙고)과 "라우팅" 행동(동적 모델 호출)을 교차시키고, 각 응답을 진화하는 컨텍스트에 통합합니다. 학습을 안내하기 위해 형식 보상, 최종 결과 보상, 그리고 성능과 비용의 균형을 최적화하기 위한 새로운 비용 보상을 포함한 경량 규칙 기반 보상을 사용하며, 이를 통해 RL을 통한 성능-비용 균형 최적화의 길을 열었습니다. Router-R1은 또한 가격, 지연 시간, 예시 성능과 같은 간단한 모델 설명자만을 조건으로 하여, 보이지 않는 모델 선택에 대한 강력한 일반화를 가능하게 합니다. 7개의 일반 및 다중 홉 QA 벤치마크에서의 실험은 Router-R1이 여러 강력한 베이스라인을 능가하며, 우수한 성능을 유지하면서도 강력한 일반화와 비용 관리를 달성함을 보여줍니다. 코드는 https://github.com/ulab-uiuc/Router-R1에서 확인할 수 있습니다.

English

The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (i.e., assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present Router-R1, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave "think" actions (internal deliberation) with "route" actions (dynamic model invocation), and integrates each response into its evolving context. To guide learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for performance and cost trade-off optimization, opening a pathway toward optimizing performance-cost tradeoffs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms over several strong baselines, achieving superior performance while maintaining robust generalization and cost management.Code is available at https://github.com/ulab-uiuc/Router-R1.

Router-R1: 강화 학습을 통한 LLM의 다중 라운드 라우팅 및 집계 교육

Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning

초록

Support