高效大型语言模型推理的动态路由与级联机制综述

摘要

随着具备不同能力、成本和领域专长的大型语言模型（LLMs）的快速发展，在推理时进行智能模型选择已成为关键需求。常规查询可由较小模型处理，但复杂任务需要能力更强的模型。然而，静态模型部署无法适应输入查询的复杂性和领域特性，导致性能欠佳与成本增加。基于查询特征自适应选择模型的动态路由系统应运而生，成为解决这一挑战的关键方案。本文对当前最先进的多LLM路由与级联方法进行了系统性分析。区别于混合专家架构在单一模型内部进行路由的模式，我们研究跨多个独立训练LLM的路由机制。我们涵盖了多样化的路由范式，包括查询难度评估、人类偏好学习、聚类分析、不确定性量化、强化学习、多模态处理以及级联策略。针对每种范式，我们分析了代表性方法并审视其核心权衡关系。在分类框架之外，我们提出一个三维概念体系来刻画路由系统的特征：决策时机、信息利用方式以及计算机制。这一视角揭示出实用系统往往具有组合性，需要在操作约束下整合多种范式。分析表明，有效的多LLM路由需要平衡相互制约的目标。最优路由策略的选择取决于部署环境与计算约束。精心设计的路由系统通过战略性利用各模型的专业能力并最大化效率收益，其性能甚至可超越最强大的单体模型。然而，开发能够跨架构、跨模态、跨应用泛化的路由机制仍存在开放挑战。

English

The rapid growth of large language models (LLMs) with diverse capabilities, costs, and domains has created a critical need for intelligent model selection at inference time. While smaller models suffice for routine queries, complex tasks demand more capable models. However, static model deployment does not account for the complexity and domain of incoming queries, leading to suboptimal performance and increased costs. Dynamic routing systems that adaptively select models based on query characteristics have emerged as a solution to this challenge. We provide a systematic analysis of state-of-the-art multi-LLM routing and cascading approaches. In contrast to mixture-of-experts architectures, which route within a single model, we study routing across multiple independently trained LLMs. We cover diverse routing paradigms, including query difficulty, human preferences, clustering, uncertainty quantification, reinforcement learning, multimodality, and cascading. For each paradigm, we analyze representative methods and examine key trade-offs. Beyond taxonomy, we introduce a conceptual framework that characterizes routing systems along three dimensions: when decisions are made, what information is used, and how they are computed. This perspective highlights that practical systems are often compositional, integrating multiple paradigms under operational constraints. Our analysis demonstrates that effective multi-LLM routing requires balancing competing objectives. Choosing the optimal routing strategy depends on deployment and computational constraints. Well-designed routing systems can outperform even the most powerful individual models by strategically leveraging specialized capabilities across models while maximizing efficiency gains. Meanwhile, open challenges remain in developing routing mechanisms that generalize across diverse architectures, modalities, and applications.

高效大型语言模型推理的动态路由与级联机制综述

Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey

摘要

Support