DynaGuard：一种支持用户自定义策略的动态防护模型

摘要

守护模型用于监督和调节面向用户的聊天机器人输出，执行防护措施并检测不良行为。标准的守护模型如LlamaGuard能够检测预定义的静态危害类别。我们提出动态守护模型，其基于用户自定义策略评估文本，使之适用于标准守护模型未涵盖的不同应用领域。我们的动态守护模型可用于快速检测策略违规，或结合链式思维推理，清晰阐述并论证模型输出。在静态危害类别的检测准确率上，动态守护模型与静态模型相当，同时能以远少于前沿推理模型的时间，准确识别自由形式策略的违规情况。

English

Guardian models are used to supervise and moderate the outputs of user-facing chatbots, enforcing guardrails and detecting bad behaviors. Standard guardian models like LlamaGuard detect predefined, static categories of harms. We propose dynamic guardian models that evaluate text based on user-defined policies, making them useful for different application domains that are not addressed by standard guardian models. Our dynamic guardian models can be used for fast detection of policy violations or with chain-of-thought reasoning that articulates and justifies the model outputs. Our dynamic guardian models match static models in detection accuracy for static harm categories while identifying violations of free-form policies with accuracy comparable to frontier reasoning models in a fraction of the time.

DynaGuard：一种支持用户自定义策略的动态防护模型

DynaGuard: A Dynamic Guardrail Model With User-Defined Policies

摘要

Support