ChatPaper.aiChatPaper

Router-R1:透過強化學習教導大型語言模型進行多輪路由與聚合

Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning

June 10, 2025
作者: Haozhen Zhang, Tao Feng, Jiaxuan You
cs.AI

摘要

多樣化大型語言模型(LLMs)的迅速崛起,促進了LLM路由器的發展,這些路由器負責將用戶查詢分配給最合適的模型。然而,現有的LLM路由器通常執行單輪、一對一的映射(即,將每個查詢單獨分配給一個模型),這限制了它們處理需要多個LLMs互補優勢的複雜任務的能力。在本文中,我們提出了Router-R1,這是一個基於強化學習(RL)的框架,將多LLM路由和聚合制定為一個序列決策過程。Router-R1將路由器本身實例化為一個能力強大的LLM,利用其推理能力在“思考”行動(內部審議)與“路由”行動(動態模型調用)之間交替進行,並將每個響應整合到其不斷演變的上下文中。為了指導學習,我們採用了一個輕量級的基於規則的獎勵,包括格式獎勵、最終結果獎勵和一個新穎的成本獎勵,用於性能和成本之間的權衡優化,開闢了一條通過RL優化性能-成本權衡的途徑。Router-R1還僅基於簡單的模型描述符(如定價、延遲和示例性能)進行條件設置,從而實現了對未見模型選擇的強大泛化能力。在七個通用和多跳QA基準測試上的實驗表明,Router-R1在保持強大泛化能力和成本管理的同時,優於多個強基線,實現了卓越的性能。代碼可在https://github.com/ulab-uiuc/Router-R1獲取。
English
The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (i.e., assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present Router-R1, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave "think" actions (internal deliberation) with "route" actions (dynamic model invocation), and integrates each response into its evolving context. To guide learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for performance and cost trade-off optimization, opening a pathway toward optimizing performance-cost tradeoffs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms over several strong baselines, achieving superior performance while maintaining robust generalization and cost management.Code is available at https://github.com/ulab-uiuc/Router-R1.
PDF42June 18, 2025