SafeRoute：大型語言模型中高效精準安全防護的自適應模型選擇

摘要

在實際應用中部署大型語言模型（LLMs）需要強大的安全防護模型來檢測和阻擋有害的使用者提示。雖然大型安全防護模型表現出色，但其計算成本相當高。為此，通常使用較小的蒸餾模型，但這些模型在「困難」樣本上往往表現不佳，而這些樣本正是大型模型能準確預測的。我們觀察到，許多輸入可以由較小模型可靠處理，僅有少部分需要大型模型的能力。基於此，我們提出了SafeRoute，一種二元路由器，用於區分困難樣本與簡單樣本。我們的方法選擇性地將大型安全防護模型應用於路由器認為困難的數據上，相比僅使用大型安全防護模型，在保持準確性的同時提高了效率。在多個基準數據集上的實驗結果表明，我們的適應性模型選擇顯著改善了計算成本與安全性能之間的權衡，超越了相關基線方法。

English

Deploying large language models (LLMs) in real-world applications requires robust safety guard models to detect and block harmful user prompts. While large safety guard models achieve strong performance, their computational cost is substantial. To mitigate this, smaller distilled models are used, but they often underperform on "hard" examples where the larger model provides accurate predictions. We observe that many inputs can be reliably handled by the smaller model, while only a small fraction require the larger model's capacity. Motivated by this, we propose SafeRoute, a binary router that distinguishes hard examples from easy ones. Our method selectively applies the larger safety guard model to the data that the router considers hard, improving efficiency while maintaining accuracy compared to solely using the larger safety guard model. Experimental results on multiple benchmark datasets demonstrate that our adaptive model selection significantly enhances the trade-off between computational cost and safety performance, outperforming relevant baselines.

SafeRoute：大型語言模型中高效精準安全防護的自適應模型選擇

SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models

摘要

Support