ChatPaper.aiChatPaper

Router-R1:通过强化学习指导大语言模型进行多轮路由与聚合

Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning

June 10, 2025
作者: Haozhen Zhang, Tao Feng, Jiaxuan You
cs.AI

摘要

大型语言模型(LLMs)的迅速多样化催生了LLM路由器的开发,这些路由器负责将用户查询分配给最合适的模型。然而,现有的LLM路由器通常执行单轮、一对一的映射(即,将每个查询单独分配给一个模型),这限制了它们处理需要多个LLM互补优势的复杂任务的能力。本文中,我们提出了Router-R1,一个基于强化学习(RL)的框架,将多LLM路由与聚合建模为一个序列决策过程。Router-R1将路由器本身实例化为一个能力强大的LLM,利用其推理能力在“思考”动作(内部审议)与“路由”动作(动态模型调用)之间交替进行,并将每个响应整合到其不断演进的上下文中。为了指导学习,我们采用了一个轻量级的基于规则的奖励机制,包括格式奖励、最终结果奖励以及一个新颖的成本奖励,用于性能和成本权衡优化,从而开辟了一条通过RL优化性能-成本权衡的路径。Router-R1还仅基于简单的模型描述符(如定价、延迟和示例性能)进行条件设定,实现了对未见模型选择的强大泛化能力。在七个通用和多跳问答基准上的实验表明,Router-R1在多个强基线之上表现优异,实现了卓越的性能,同时保持了强大的泛化能力和成本管理。代码可在https://github.com/ulab-uiuc/Router-R1获取。
English
The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (i.e., assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present Router-R1, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave "think" actions (internal deliberation) with "route" actions (dynamic model invocation), and integrates each response into its evolving context. To guide learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for performance and cost trade-off optimization, opening a pathway toward optimizing performance-cost tradeoffs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms over several strong baselines, achieving superior performance while maintaining robust generalization and cost management.Code is available at https://github.com/ulab-uiuc/Router-R1.
PDF42June 18, 2025