効率的な大規模言語モデル推論のための動的モデルルーティングとカスケード：サーベイ

要旨

多様な能力、コスト、ドメインを持つ大規模言語モデル（LLM）の急速な発展に伴い、推論時におけるインテリジェントなモデル選択の必要性が高まっている。日常的な問い合わせには小規模モデルで十分である一方、複雑なタスクではより高度な能力を持つモデルが要求される。しかし、静的なモデル配備では、入力されるクエリの複雑さやドメインを考慮できないため、パフォーマンスの低下やコスト増加を招く。この課題に対処するため、クエリの特性に基づいて適応的にモデルを選択する動的ルーティングシステムが登場している。本稿では、最先端の複数LLMルーティング及びカスケード手法に関する体系的分析を行う。単一モデル内でルーティングを行うMixture of Expertsアーキテクチャとは対照的に、我々は独立に訓練された複数のLLM間でのルーティングに焦点を当てる。クエリの難易度、人間の選好、クラスタリング、不確実性定量化、強化学習、マルチモーダル、カスケードなど、多様なルーティングパラダイムを網羅する。各パラダイムについて、代表的手法を分析し、主要なトレードオフを検討する。分類体系に加えて、ルーティングシステムを「決定のタイミング」「利用される情報」「計算方法」の3次元で特徴づける概念的枠組みを提案する。この視点は、実用的なシステムが運用上の制約の下で複数のパラダイムを統合した、しばしば複合的な構造を持つことを明らかにする。我々の分析は、効果的な複数LLMルーティングには相反する目的のバランス調整が不可欠であることを示す。最適なルーティング戦略の選択は、配備環境と計算資源の制約に依存する。適切に設計されたルーティングシステムは、モデル間の専門的能力を戦略的に活用し、効率性の向上を最大化することで、最も強力な単一モデルを上回る性能を発揮し得る。一方、多様なアーキテクチャ、モダリティ、アプリケーションに汎化するルーティング機構の開発には、未解決の課題が残されている。

English

The rapid growth of large language models (LLMs) with diverse capabilities, costs, and domains has created a critical need for intelligent model selection at inference time. While smaller models suffice for routine queries, complex tasks demand more capable models. However, static model deployment does not account for the complexity and domain of incoming queries, leading to suboptimal performance and increased costs. Dynamic routing systems that adaptively select models based on query characteristics have emerged as a solution to this challenge. We provide a systematic analysis of state-of-the-art multi-LLM routing and cascading approaches. In contrast to mixture-of-experts architectures, which route within a single model, we study routing across multiple independently trained LLMs. We cover diverse routing paradigms, including query difficulty, human preferences, clustering, uncertainty quantification, reinforcement learning, multimodality, and cascading. For each paradigm, we analyze representative methods and examine key trade-offs. Beyond taxonomy, we introduce a conceptual framework that characterizes routing systems along three dimensions: when decisions are made, what information is used, and how they are computed. This perspective highlights that practical systems are often compositional, integrating multiple paradigms under operational constraints. Our analysis demonstrates that effective multi-LLM routing requires balancing competing objectives. Choosing the optimal routing strategy depends on deployment and computational constraints. Well-designed routing systems can outperform even the most powerful individual models by strategically leveraging specialized capabilities across models while maximizing efficiency gains. Meanwhile, open challenges remain in developing routing mechanisms that generalize across diverse architectures, modalities, and applications.

効率的な大規模言語モデル推論のための動的モデルルーティングとカスケード：サーベイ

Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey

要旨

Support