효율적인 LLM 추론을 위한 동적 모델 라우팅 및 캐스케이딩: 기술 동향 분석

초록

다양한 역량, 비용, 영역을 지닌 대규모 언어 모델(LLM)의 급속한 성장은 추론 시점에서의 지능형 모델 선택에 대한 중요한 필요성을 창출했습니다. 일상적인 질의에는 소규모 모델로 충분하지만, 복잡한 작업은 더 높은 역량을 가진 모델을 요구합니다. 그러나 정적인 모델 배포 방식은 들어오는 질의의 복잡성과 영역을 고려하지 않아 성능 저하와 비용 증가를 초래합니다. 질의 특성에 따라 적응적으로 모델을 선택하는 동적 라우팅 시스템이 이러한 과제에 대한 해법으로 등장했습니다. 본 논문은 최첨단 다중 LLM 라우팅 및 캐스케이딩 접근법에 대한 체계적인 분석을 제공합니다. 단일 모델 내에서 라우팅을 수행하는 전문가 혼합(MoE) 아키텍처와 대비하여, 우리는 독립적으로 훈련된 여러 LLM 간의 라우팅을 연구합니다. 우리는 질의 난이도, 인간 선호도, 클러스터링, 불확실성 정량화, 강화 학습, 다중 모달리티, 캐스케이딩 등 다양한 라우팅 패러다임을 다룹니다. 각 패러다임에 대해 대표적인 방법론을 분석하고 주요 절충점을 검토합니다. 분류 체계를 넘어, 우리는 라우팅 시스템을 의사 결정 시점, 활용 정보, 계산 방식이라는 세 가지 차원에서 특징짓는 개념적 프레임워크를 소개합니다. 이러한 관점은 실용적인 시스템이 종종 운영 제약 하에 여러 패러다임을 통합하는 구성적 특성을 가짐을 강조합니다. 우리의 분석은 효과적인 다중 LLM 라우팅이 상충되는 목표들 간의 균형을 요구함을 보여줍니다. 최적의 라우팅 전략 선택은 배포 및 계산상의 제약에 따라 달라집니다. 잘 설계된 라우팅 시스템은 모델 간 특화된 역량을 전략적으로 활용하고 효율성 이득을 극대화함으로써 가장 강력한 단일 모델보다도 뛰어난 성능을 발휘할 수 있습니다. 한편, 다양한 아키텍처, 모달리티, 응용 분야에 걸쳐 일반화되는 라우팅 메커니즘 개발에는 여전히 해결과제가 남아 있습니다.

English

The rapid growth of large language models (LLMs) with diverse capabilities, costs, and domains has created a critical need for intelligent model selection at inference time. While smaller models suffice for routine queries, complex tasks demand more capable models. However, static model deployment does not account for the complexity and domain of incoming queries, leading to suboptimal performance and increased costs. Dynamic routing systems that adaptively select models based on query characteristics have emerged as a solution to this challenge. We provide a systematic analysis of state-of-the-art multi-LLM routing and cascading approaches. In contrast to mixture-of-experts architectures, which route within a single model, we study routing across multiple independently trained LLMs. We cover diverse routing paradigms, including query difficulty, human preferences, clustering, uncertainty quantification, reinforcement learning, multimodality, and cascading. For each paradigm, we analyze representative methods and examine key trade-offs. Beyond taxonomy, we introduce a conceptual framework that characterizes routing systems along three dimensions: when decisions are made, what information is used, and how they are computed. This perspective highlights that practical systems are often compositional, integrating multiple paradigms under operational constraints. Our analysis demonstrates that effective multi-LLM routing requires balancing competing objectives. Choosing the optimal routing strategy depends on deployment and computational constraints. Well-designed routing systems can outperform even the most powerful individual models by strategically leveraging specialized capabilities across models while maximizing efficiency gains. Meanwhile, open challenges remain in developing routing mechanisms that generalize across diverse architectures, modalities, and applications.

효율적인 LLM 추론을 위한 동적 모델 라우팅 및 캐스케이딩: 기술 동향 분석

Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey

초록

Support