基于强盗反馈学习大语言模型路由策略：单一策略，多重权衡

摘要

高效利用大型语言模型（LLMs）对于大规模部署至关重要：若无自适应路由机制，系统要么为强大模型支付过高成本，要么因使用较弱模型而面临性能不佳的风险。为每个查询选择合适的LLM本质上是一个在线决策问题：模型各有所长，价格波动不定，且用户对准确性与成本的重视程度各异。然而，大多数路由器的训练是在线下进行的，依赖于所有候选模型的标签，这一假设在部署时被打破，因为仅能观察到所选模型的结果。我们通过BaRP（基于偏好的Bandit反馈路由方法）填补了这一空白，该方法在训练时采用与部署相同的部分反馈限制，同时支持偏好可调的推理：操作者无需重新训练即可在测试时调整性能与成本的权衡。将问题框架化为基于提示特征和用户偏好向量的上下文Bandit，我们的方法在训练期间模拟在线反馈环境，并根据每个新提示调整其路由决策，而非依赖于全信息的线下监督。全面实验表明，我们的方法始终优于强大的线下路由器至少12.46%，并超过最大LLM至少2.45%，且在面对未见任务时展现出强大的泛化能力。

English

Efficient use of large language models (LLMs) is critical for deployment at scale: without adaptive routing, systems either overpay for strong models or risk poor performance from weaker ones. Selecting the right LLM for each query is fundamentally an online decision problem: models differ in strengths, prices fluctuate, and users value accuracy and cost differently. Yet most routers are trained offline with labels for all candidate models, an assumption that breaks in deployment, where only the outcome of the chosen model is observed. We bridge this gap with BaRP, a Bandit-feedback Routing with Preferences approach that trains under the same partial-feedback restriction as deployment, while supporting preference-tunable inference: operators can dial the performance/cost trade-off at test time without retraining. Framed as a contextual bandit over prompt features and a user preference vector, our method simulates an online feedback setting during training and adapts its routing decisions to each new prompt, rather than depending on full-information offline supervision. Comprehensive experiments show that our method consistently outperforms strong offline routers by at least 12.46% and the largest LLM by at least 2.45%, and generalizes robustly for unseen tasks.

基于强盗反馈学习大语言模型路由策略：单一策略，多重权衡

Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs

摘要

Support