Router-R1: 強化学習によるLLMのマルチラウンドルーティングとアグリゲーションの教育

要旨

多様な大規模言語モデル（LLM）の急速な出現により、ユーザークエリを最も適したモデルに割り当てるLLMルーターの開発が進んでいます。しかし、既存のLLMルーターは通常、単一ラウンドの1対1マッピング（つまり、各クエリを単一のモデルに個別に割り当てる）を行うため、複数のLLMの補完的な強みを必要とする複雑なタスクに対処する能力が制限されています。本論文では、強化学習（RL）ベースのフレームワークであるRouter-R1を紹介します。Router-R1は、複数LLMのルーティングと集約を逐次決定プロセスとして定式化します。Router-R1は、ルーター自体を有能なLLMとしてインスタンス化し、その推論能力を活用して「考える」アクション（内部審議）と「ルート」アクション（動的モデル呼び出し）を交互に行い、各応答を進化するコンテキストに統合します。学習を導くために、フォーマット報酬、最終結果報酬、および性能とコストのトレードオフ最適化のための新しいコスト報酬を含む軽量なルールベースの報酬を採用し、RLを介した性能とコストのトレードオフ最適化への道を開きます。Router-R1はまた、価格、レイテンシ、および例示的な性能などの単純なモデル記述子にのみ条件付けを行うため、未見のモデル選択に対する強力な汎化を可能にします。7つの一般およびマルチホップQAベンチマークでの実験により、Router-R1がいくつかの強力なベースラインを上回り、優れた性能を維持しながら、堅牢な汎化とコスト管理を実現することが示されました。コードはhttps://github.com/ulab-uiuc/Router-R1で公開されています。

English

The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (i.e., assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present Router-R1, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave "think" actions (internal deliberation) with "route" actions (dynamic model invocation), and integrates each response into its evolving context. To guide learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for performance and cost trade-off optimization, opening a pathway toward optimizing performance-cost tradeoffs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms over several strong baselines, achieving superior performance while maintaining robust generalization and cost management.Code is available at https://github.com/ulab-uiuc/Router-R1.

Router-R1: 強化学習によるLLMのマルチラウンドルーティングとアグリゲーションの教育

Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning

要旨

Support