クラスタリング、ルーティング、エスカレーション：コスト認識型LLMサービングのためのカスケードフレームワーク

要旨

大規模言語モデル（LLM）を実運用環境で効率的に展開するには、精度とコストのトレードオフが生じる。多くの場合、運用者は単一のモデルをデフォルトで使用するが、そのモデルは簡単なクエリに対しては高コストとなり、難しいクエリに対しては性能不足となる。この課題に対処するため、本稿では2段階のカスケード方式を提案する。第1段階では、受信したクエリをクラスタリングし、各クラスタを最も費用対効果の高いモデルに割り当てる。このルーティングプロセスのコスト予算は、解釈可能なハイパーパラメータによって設定され、オフラインで調整される。第2段階では、品質推定（QE）カスケードを追加する。第1段階の出力が低品質と判断された場合、クエリはより強力なモデルにエスカレーションされる。これにより、困難または低信頼度のケースのみが高コストのモデルに送られる。テストデータセットにおいて、本カスケードシステムは最も強力なモデルの精度の97～99%を維持しつつ、出力トークンあたりの時間（TPOT）を削減する。本方式はタスク正解ラベルのみを必要とし、モデルプールの変更にも手動再設定なく適応可能である。

English

Efficient deployment of large language models (LLMs) in production forces a trade-off between accuracy and cost. Operators often default to a single model that is either expensive for easy queries or insufficient for hard ones. To address this challenge, we propose a two-stage cascaded solution. Stage 1 clusters incoming queries and assigns each cluster to its most cost-effective model. The cost budget for this routing process is set by an interpretable hyperparameter, tuned offline. Stage 2 adds a quality estimation (QE) cascade; when an output from Stage 1 is judged low-quality, the query is escalated to a stronger model. This ensures only hard or low-confidence cases reach the expensive models. On the test datasets, the cascaded system retains 97-99% of the strongest model's accuracy while reducing Time Per Output Token (TPOT). It requires only task-correctness labels and adapts to changes in the model pool without manual reconfiguration.