动态模型路由与级联：高效大语言模型推理方法综述

摘要

随着大型语言模型（LLM）在能力、成本和应用领域的快速分化，推理阶段的智能模型选择已成为关键需求。常规查询可由轻量级模型处理，而复杂任务则需要调用更强能力的模型。然而，静态模型部署无法适应动态请求的复杂度和领域特性，导致性能欠佳与成本攀升。基于查询特征的自适应动态路由系统应运而生，成为解决这一挑战的关键路径。本文系统分析了当前最前沿的多LLM路由与级联技术。区别于混合专家架构在单一模型内部进行路由的模式，我们聚焦于跨独立训练LLM的路由机制。研究涵盖多样化路由范式，包括查询难度评估、人类偏好匹配、聚类分析、不确定性量化、强化学习、多模态路由及级联调度。针对每种范式，我们解析代表性方法并剖析其核心权衡关系。在分类框架之外，我们提出三维度概念框架：从决策时机（when）、信息依据（what）到计算方式（how）刻画路由系统特性。该视角揭示实际系统往往在操作约束下融合多种范式，形成复合型架构。研究表明，有效的多LLM路由需平衡多重竞争目标。最优路由策略的选择取决于部署环境与计算约束。精心设计的路由系统通过战略性调度不同模型的专长能力，在最大化效率增益的同时，其综合表现甚至可超越单体最强模型。当前该领域仍存在重要挑战，包括开发能跨架构、跨模态、跨应用泛化的路由机制等。

English

The rapid growth of large language models (LLMs) with diverse capabilities, costs, and domains has created a critical need for intelligent model selection at inference time. While smaller models suffice for routine queries, complex tasks demand more capable models. However, static model deployment does not account for the complexity and domain of incoming queries, leading to suboptimal performance and increased costs. Dynamic routing systems that adaptively select models based on query characteristics have emerged as a solution to this challenge. We provide a systematic analysis of state-of-the-art multi-LLM routing and cascading approaches. In contrast to mixture-of-experts architectures, which route within a single model, we study routing across multiple independently trained LLMs. We cover diverse routing paradigms, including query difficulty, human preferences, clustering, uncertainty quantification, reinforcement learning, multimodality, and cascading. For each paradigm, we analyze representative methods and examine key trade-offs. Beyond taxonomy, we introduce a conceptual framework that characterizes routing systems along three dimensions: when decisions are made, what information is used, and how they are computed. This perspective highlights that practical systems are often compositional, integrating multiple paradigms under operational constraints. Our analysis demonstrates that effective multi-LLM routing requires balancing competing objectives. Choosing the optimal routing strategy depends on deployment and computational constraints. Well-designed routing systems can outperform even the most powerful individual models by strategically leveraging specialized capabilities across models while maximizing efficiency gains. Meanwhile, open challenges remain in developing routing mechanisms that generalize across diverse architectures, modalities, and applications.