대역폭 피드백을 통해 LLM 라우팅 학습하기: 하나의 정책, 다양한 트레이드오프

초록

대규모 언어 모델(LLM)의 효율적인 사용은 대규모 배포에 있어 핵심적입니다: 적응형 라우팅이 없다면 시스템은 강력한 모델에 과도한 비용을 지불하거나 약한 모델로 인해 낮은 성능을 감수해야 합니다. 각 쿼리에 적합한 LLM을 선택하는 것은 근본적으로 온라인 의사결정 문제입니다: 모델마다 강점이 다르고, 가격은 변동하며, 사용자들은 정확도와 비용을 다르게 평가합니다. 그러나 대부분의 라우터는 모든 후보 모델에 대한 레이블을 사용해 오프라인으로 학습되며, 이는 배포 환경에서 선택된 모델의 결과만 관찰된다는 가정과 상충됩니다. 우리는 이러한 격차를 BaRP(Bandit-feedback Routing with Preferences) 접근법으로 해결합니다. 이 방법은 배포와 동일한 부분 피드백 제약 하에서 학습하면서도, 성능/비용 트레이드오프를 테스트 시점에서 조정할 수 있는 선호 조정 가능 추론을 지원합니다. 프롬프트 특징과 사용자 선호 벡터에 대한 컨텍스트얼 밴딧으로 프레임된 우리의 방법은 학습 중 온라인 피드백 설정을 시뮬레이션하고, 각 새로운 프롬프트에 맞춰 라우팅 결정을 조정하며, 완전 정보 오프라인 감독에 의존하지 않습니다. 포괄적인 실험 결과, 우리의 방법은 강력한 오프라인 라우터를 최소 12.46%, 가장 큰 LLM을 최소 2.45% 이상 능가하며, 보이지 않는 작업에 대해 강건하게 일반화합니다.

English

Efficient use of large language models (LLMs) is critical for deployment at scale: without adaptive routing, systems either overpay for strong models or risk poor performance from weaker ones. Selecting the right LLM for each query is fundamentally an online decision problem: models differ in strengths, prices fluctuate, and users value accuracy and cost differently. Yet most routers are trained offline with labels for all candidate models, an assumption that breaks in deployment, where only the outcome of the chosen model is observed. We bridge this gap with BaRP, a Bandit-feedback Routing with Preferences approach that trains under the same partial-feedback restriction as deployment, while supporting preference-tunable inference: operators can dial the performance/cost trade-off at test time without retraining. Framed as a contextual bandit over prompt features and a user preference vector, our method simulates an online feedback setting during training and adapts its routing decisions to each new prompt, rather than depending on full-information offline supervision. Comprehensive experiments show that our method consistently outperforms strong offline routers by at least 12.46% and the largest LLM by at least 2.45%, and generalizes robustly for unseen tasks.

대역폭 피드백을 통해 LLM 라우팅 학습하기: 하나의 정책, 다양한 트레이드오프

Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs

초록

Support