DynaGuard: ユーザー定義ポリシーを備えた動的ガードレールモデル

要旨

ガーディアンモデルは、ユーザー向けチャットボットの出力を監視および調整し、ガードレールを適用して不適切な行動を検出するために使用されます。LlamaGuardのような標準的なガーディアンモデルは、事前に定義された静的な有害カテゴリを検出します。私たちは、ユーザー定義のポリシーに基づいてテキストを評価する動的ガーディアンモデルを提案し、標準的なガーディアンモデルでは対応できないさまざまなアプリケーションドメインで有用となるようにします。私たちの動的ガーディアンモデルは、ポリシー違反の迅速な検出や、モデルの出力を明確に説明し正当化するチェーン・オブ・シンク推論とともに使用できます。私たちの動的ガーディアンモデルは、静的な有害カテゴリの検出精度において静的モデルと同等でありながら、自由形式のポリシー違反を、最先端の推論モデルに匹敵する精度で、はるかに短時間で識別します。

English

Guardian models are used to supervise and moderate the outputs of user-facing chatbots, enforcing guardrails and detecting bad behaviors. Standard guardian models like LlamaGuard detect predefined, static categories of harms. We propose dynamic guardian models that evaluate text based on user-defined policies, making them useful for different application domains that are not addressed by standard guardian models. Our dynamic guardian models can be used for fast detection of policy violations or with chain-of-thought reasoning that articulates and justifies the model outputs. Our dynamic guardian models match static models in detection accuracy for static harm categories while identifying violations of free-form policies with accuracy comparable to frontier reasoning models in a fraction of the time.

DynaGuard: ユーザー定義ポリシーを備えた動的ガードレールモデル

DynaGuard: A Dynamic Guardrail Model With User-Defined Policies

要旨

Support