DynaGuard: 사용자 정의 정책을 지원하는 동적 가드레일 모델

초록

가디언 모델은 사용자 대면 챗봇의 출력을 감독하고 조정하여 가드레일을 강제하고 부적절한 행동을 탐지하는 데 사용됩니다. LlamaGuard와 같은 표준 가디언 모델은 미리 정의된 정적 유해 범주를 탐지합니다. 우리는 사용자 정의 정책에 따라 텍스트를 평가하는 동적 가디언 모델을 제안하며, 이를 통해 표준 가디언 모델로는 다루지 못하는 다양한 응용 분야에서 유용하게 활용할 수 있습니다. 우리의 동적 가디언 모델은 정책 위반을 빠르게 탐지하거나, 모델 출력을 명확히 설명하고 정당화하는 사고의 연쇄(chain-of-thought) 추론과 함께 사용될 수 있습니다. 우리의 동적 가디언 모델은 정적 유해 범주에 대한 탐지 정확도에서 정적 모델과 동등한 성능을 보이면서도, 자유 형식 정책 위반을 탐지하는 데 있어서 최신 추론 모델과 비슷한 정확도를 훨씬 짧은 시간 내에 달성합니다.

English

Guardian models are used to supervise and moderate the outputs of user-facing chatbots, enforcing guardrails and detecting bad behaviors. Standard guardian models like LlamaGuard detect predefined, static categories of harms. We propose dynamic guardian models that evaluate text based on user-defined policies, making them useful for different application domains that are not addressed by standard guardian models. Our dynamic guardian models can be used for fast detection of policy violations or with chain-of-thought reasoning that articulates and justifies the model outputs. Our dynamic guardian models match static models in detection accuracy for static harm categories while identifying violations of free-form policies with accuracy comparable to frontier reasoning models in a fraction of the time.

DynaGuard: 사용자 정의 정책을 지원하는 동적 가드레일 모델

DynaGuard: A Dynamic Guardrail Model With User-Defined Policies

초록

Support