PrimeGuard: 튜닝 없이 라우팅을 통해 안전하고 유용한 대형 언어 모델 구현

초록

언어 모델(LM)을 배포하기 위해서는 출력이 고품질이면서도 안전 가이드라인을 준수해야 합니다. 추론 시 가드레일(Inference-Time Guardrails, ITG)은 모델 출력 분포를 준수 방향으로 전환하는 해결책을 제공하지만, 현재의 방법들은 안전성과 유용성 간의 균형을 맞추는 데 어려움을 겪고 있습니다. 비준수 쿼리를 안전하게 처리하는 ITG 방법들은 유용성이 낮은 반면, 유용성을 우선시하는 방법들은 안전성을 희생합니다. 우리는 이러한 트레이드오프를 가드레일 세금(guardrail tax)이라고 부르며, 이는 정렬 세금(alignment tax)과 유사합니다. 이를 해결하기 위해 우리는 구조화된 제어 흐름을 활용한 새로운 ITG 방법인 PrimeGuard를 제안합니다. PrimeGuard는 다양한 지시를 가진 LM의 자기 인스턴스화(self-instantiation)로 요청을 라우팅하며, LM의 내재된 지시 수행 능력과 문맥 학습(in-context learning)을 활용합니다. 우리의 튜닝 없는 접근 방식은 각 쿼리에 대해 시스템 설계자 가이드라인을 동적으로 컴파일합니다. 또한, 우리는 다양한 레드팀 안전 벤치마크인 safe-eval을 구축하고 공개했습니다. 광범위한 평가 결과, PrimeGuard는 튜닝 없이도 (1) 반복적인 탈옥 공격에 대한 저항성을 크게 높이고, (2) 안전 가드레일링에서 최첨단 결과를 달성하며, (3) 정렬 튜닝된 모델의 유용성 점수와 동등한 성능을 보임으로써 가드레일 세금을 극복했습니다. 평가 결과, PrimeGuard는 튜닝 없이도 모든 경쟁 기준선을 능가하며, 안전 응답 비율을 61%에서 97%로 향상시키고, 가장 큰 모델에서 평균 유용성 점수를 4.17에서 4.29로 높이며, 공격 성공률을 100%에서 8%로 감소시켰습니다. PrimeGuard 구현은 https://github.com/dynamofl/PrimeGuard에서 확인할 수 있으며, safe-eval 데이터셋은 https://huggingface.co/datasets/dynamoai/safe_eval에서 이용 가능합니다.

English

Deploying language models (LMs) necessitates outputs to be both high-quality and compliant with safety guidelines. Although Inference-Time Guardrails (ITG) offer solutions that shift model output distributions towards compliance, we find that current methods struggle in balancing safety with helpfulness. ITG Methods that safely address non-compliant queries exhibit lower helpfulness while those that prioritize helpfulness compromise on safety. We refer to this trade-off as the guardrail tax, analogous to the alignment tax. To address this, we propose PrimeGuard, a novel ITG method that utilizes structured control flow. PrimeGuard routes requests to different self-instantiations of the LM with varying instructions, leveraging its inherent instruction-following capabilities and in-context learning. Our tuning-free approach dynamically compiles system-designer guidelines for each query. We construct and release safe-eval, a diverse red-team safety benchmark. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, overcomes the guardrail tax by (1) significantly increasing resistance to iterative jailbreak attacks and (2) achieving state-of-the-art results in safety guardrailing while (3) matching helpfulness scores of alignment-tuned models. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, outperforms all competing baselines and overcomes the guardrail tax by improving the fraction of safe responses from 61% to 97% and increasing average helpfulness scores from 4.17 to 4.29 on the largest models, while reducing attack success rate from 100% to 8%. PrimeGuard implementation is available at https://github.com/dynamofl/PrimeGuard and safe-eval dataset is available at https://huggingface.co/datasets/dynamoai/safe_eval.

PrimeGuard: 튜닝 없이 라우팅을 통해 안전하고 유용한 대형 언어 모델 구현

PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing

초록

Support