PrimeGuard:通过无需调整的路由实现安全和有用的LLMs
PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
July 23, 2024
作者: Blazej Manczak, Eliott Zemour, Eric Lin, Vaikkunth Mugunthan
cs.AI
摘要
部署语言模型(LMs)需要输出既具有高质量,又符合安全准则。尽管推理时间防护(ITG)提供了将模型输出分布转向符合性的解决方案,但我们发现当前方法在平衡安全性和实用性方面存在困难。安全地处理不符合规范的查询的ITG方法表现出较低的实用性,而优先考虑实用性的方法则会牺牲安全性。我们将这种权衡称为防护栏税,类似于对齐税。为了解决这个问题,我们提出了PrimeGuard,一种利用结构化控制流的新型ITG方法。
PrimeGuard将请求路由到LM的不同自实例,具有不同的指令,利用其固有的遵循指令能力和上下文学习。我们的无调整方法动态编译每个查询的系统设计准则。我们构建并发布了safe-eval,一个多样化的红队安全基准。广泛的评估表明,PrimeGuard在无需微调的情况下,通过(1)显著增加对迭代越狱攻击的抵抗力,(2)在安全防护方面取得了最先进的结果,同时(3)匹配了对齐调整模型的实用性评分。广泛的评估表明,PrimeGuard在无需微调的情况下,优于所有竞争基线,并通过将安全响应的比例从61%提高到97%,将最大模型的平均实用性评分从4.17提高到4.29,同时将攻击成功率从100%降低到8%。
PrimeGuard的实现可在https://github.com/dynamofl/PrimeGuard找到,safe-eval数据集可在https://huggingface.co/datasets/dynamoai/safe_eval找到。
English
Deploying language models (LMs) necessitates outputs to be both high-quality
and compliant with safety guidelines. Although Inference-Time Guardrails (ITG)
offer solutions that shift model output distributions towards compliance, we
find that current methods struggle in balancing safety with helpfulness. ITG
Methods that safely address non-compliant queries exhibit lower helpfulness
while those that prioritize helpfulness compromise on safety. We refer to this
trade-off as the guardrail tax, analogous to the alignment tax. To address
this, we propose PrimeGuard, a novel ITG method that utilizes structured
control flow.
PrimeGuard routes requests to different self-instantiations of the LM with
varying instructions, leveraging its inherent instruction-following
capabilities and in-context learning. Our tuning-free approach dynamically
compiles system-designer guidelines for each query. We construct and release
safe-eval, a diverse red-team safety benchmark. Extensive evaluations
demonstrate that PrimeGuard, without fine-tuning, overcomes the guardrail tax
by (1) significantly increasing resistance to iterative jailbreak attacks and
(2) achieving state-of-the-art results in safety guardrailing while (3)
matching helpfulness scores of alignment-tuned models. Extensive evaluations
demonstrate that PrimeGuard, without fine-tuning, outperforms all competing
baselines and overcomes the guardrail tax by improving the fraction of safe
responses from 61% to 97% and increasing average helpfulness scores from 4.17
to 4.29 on the largest models, while reducing attack success rate from 100% to
8%.
PrimeGuard implementation is available at
https://github.com/dynamofl/PrimeGuard and safe-eval dataset is available at
https://huggingface.co/datasets/dynamoai/safe_eval.Summary
AI-Generated Summary