SafeRoute: 大規模言語モデルにおける効率的かつ正確な安全性ガードレールのための適応的モデル選択

要旨

大規模言語モデル（LLM）を実世界のアプリケーションに展開する際には、有害なユーザープロンプトを検出しブロックするための堅牢なセーフティガードモデルが必要です。大規模なセーフティガードモデルは高い性能を発揮しますが、その計算コストは非常に大きくなります。これを軽減するために、小型化された蒸留モデルが使用されますが、これらのモデルは、大規模モデルが正確な予測を提供する「難しい」事例においてしばしば性能が低下します。我々は、多くの入力が小型モデルで確実に処理可能であり、大規模モデルの能力を必要とするのはごく一部の事例であることを観察しました。この観察に基づき、我々はSafeRouteを提案します。これは、難しい事例と容易な事例を区別するバイナリルーターです。本手法では、ルーターが難しいと判断したデータに対してのみ大規模なセーフティガードモデルを選択的に適用し、大規模モデルのみを使用する場合と比較して効率を向上させつつ精度を維持します。複数のベンチマークデータセットにおける実験結果は、適応的なモデル選択が計算コストと安全性の性能のトレードオフを大幅に改善し、関連するベースラインを上回ることを示しています。

English

Deploying large language models (LLMs) in real-world applications requires robust safety guard models to detect and block harmful user prompts. While large safety guard models achieve strong performance, their computational cost is substantial. To mitigate this, smaller distilled models are used, but they often underperform on "hard" examples where the larger model provides accurate predictions. We observe that many inputs can be reliably handled by the smaller model, while only a small fraction require the larger model's capacity. Motivated by this, we propose SafeRoute, a binary router that distinguishes hard examples from easy ones. Our method selectively applies the larger safety guard model to the data that the router considers hard, improving efficiency while maintaining accuracy compared to solely using the larger safety guard model. Experimental results on multiple benchmark datasets demonstrate that our adaptive model selection significantly enhances the trade-off between computational cost and safety performance, outperforming relevant baselines.

SafeRoute: 大規模言語モデルにおける効率的かつ正確な安全性ガードレールのための適応的モデル選択

SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models

要旨

Support