유연한 대형 언어 모델 가드레일 개발 방법론 주제 이탈 감지에 적용

초록

대형 언어 모델은 주로 의도되지 않은 범위를 벗어나는 사용 방식에 취약합니다. 사용자가 이러한 모델에 의도된 범위를 벗어나는 작업을 수행하도록 유도할 수 있습니다. 현재의 가드레일은 주로 선별된 예제나 사용자 정의 분류기에 의존하는데, 이러한 방법들은 높은 거짓 양성률, 제한된 적응성, 그리고 사전 제작 단계에서 사용할 수 없는 실제 데이터를 요구하는 불합리함이 있습니다. 본 논문에서는 이러한 도전에 대처하는 유연하고 데이터 무관한 가드레일 개발 방법론을 소개합니다. 우리는 문제 공간을 질적으로 철저히 정의하고 이를 대규모 언어 모델(Large Language Models, LLM)에 전달하여 다양한 프롬프트를 생성하도록 함으로써, 합성 데이터셋을 구축하여 범위를 벗어난 사용 방지 가드레일을 평가하고 훈련시킵니다. 또한 사용자 프롬프트가 시스템 프롬프트와 관련이 있는지를 분류하는 작업으로 설정함으로써, 우리의 가드레일은 감옥 탈출 및 유해한 프롬프트를 포함한 다른 남용 범주에 효과적으로 일반화됩니다. 마지막으로, 우리는 합성 데이터셋과 범위를 벗어난 가드레일 모델을 오픈 소스로 공개함으로써, 사전 제작 환경에서 가드레일을 개발하고 LLM 안전성에 대한 미래 연구 및 개발을 지원하는 가치 있는 자원을 제공합니다.

English

Large Language Models are prone to off-topic misuse, where users may prompt these models to perform tasks beyond their intended scope. Current guardrails, which often rely on curated examples or custom classifiers, suffer from high false-positive rates, limited adaptability, and the impracticality of requiring real-world data that is not available in pre-production. In this paper, we introduce a flexible, data-free guardrail development methodology that addresses these challenges. By thoroughly defining the problem space qualitatively and passing this to an LLM to generate diverse prompts, we construct a synthetic dataset to benchmark and train off-topic guardrails that outperform heuristic approaches. Additionally, by framing the task as classifying whether the user prompt is relevant with respect to the system prompt, our guardrails effectively generalize to other misuse categories, including jailbreak and harmful prompts. Lastly, we further contribute to the field by open-sourcing both the synthetic dataset and the off-topic guardrail models, providing valuable resources for developing guardrails in pre-production environments and supporting future research and development in LLM safety.

유연한 대형 언어 모델 가드레일 개발 방법론 주제 이탈 감지에 적용

A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection

초록

Support