PrimeGuard:透過無需調整的路由實現安全且有用的LLMs
PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
July 23, 2024
作者: Blazej Manczak, Eliott Zemour, Eric Lin, Vaikkunth Mugunthan
cs.AI
摘要
部署語言模型(LMs)需要輸出既具高質量又符合安全指南。儘管推理時間護欄(ITG)提供解決方案,將模型輸出分佈轉向符合標準,我們發現當前方法在平衡安全性與幫助性方面存在困難。ITG方法安全地處理不符合標準的查詢時,顯示出較低的幫助性,而那些優先考慮幫助性的方法則會妥協安全性。我們將這種權衡稱為護欄稅,類似於對齊稅。為了解決這個問題,我們提出了PrimeGuard,一種利用結構化控制流的新型ITG方法。
PrimeGuard將請求路由到LM的不同自我實例,並使用不同的指令,利用其固有的遵循指令能力和上下文學習。我們的調整免費方法動態編譯系統設計指南,針對每個查詢。我們構建並發布了safe-eval,一個多樣的紅隊安全基準。廣泛評估表明,PrimeGuard在無需微調的情況下,通過(1)顯著提高對迭代越獄攻擊的抵抗力,(2)在安全護欄方面取得了最新成果,同時(3)與對齊調整模型的幫助性得分相匹配,克服了護欄稅。廣泛評估表明,PrimeGuard在無需微調的情況下,優於所有競爭基線,通過將安全響應的比例從61%提高到97%,並將最大模型的平均幫助性得分從4.17提高到4.29,同時將攻擊成功率從100%降低到8%。
PrimeGuard的實施可在https://github.com/dynamofl/PrimeGuard找到,safe-eval數據集可在https://huggingface.co/datasets/dynamoai/safe_eval找到。
English
Deploying language models (LMs) necessitates outputs to be both high-quality
and compliant with safety guidelines. Although Inference-Time Guardrails (ITG)
offer solutions that shift model output distributions towards compliance, we
find that current methods struggle in balancing safety with helpfulness. ITG
Methods that safely address non-compliant queries exhibit lower helpfulness
while those that prioritize helpfulness compromise on safety. We refer to this
trade-off as the guardrail tax, analogous to the alignment tax. To address
this, we propose PrimeGuard, a novel ITG method that utilizes structured
control flow.
PrimeGuard routes requests to different self-instantiations of the LM with
varying instructions, leveraging its inherent instruction-following
capabilities and in-context learning. Our tuning-free approach dynamically
compiles system-designer guidelines for each query. We construct and release
safe-eval, a diverse red-team safety benchmark. Extensive evaluations
demonstrate that PrimeGuard, without fine-tuning, overcomes the guardrail tax
by (1) significantly increasing resistance to iterative jailbreak attacks and
(2) achieving state-of-the-art results in safety guardrailing while (3)
matching helpfulness scores of alignment-tuned models. Extensive evaluations
demonstrate that PrimeGuard, without fine-tuning, outperforms all competing
baselines and overcomes the guardrail tax by improving the fraction of safe
responses from 61% to 97% and increasing average helpfulness scores from 4.17
to 4.29 on the largest models, while reducing attack success rate from 100% to
8%.
PrimeGuard implementation is available at
https://github.com/dynamofl/PrimeGuard and safe-eval dataset is available at
https://huggingface.co/datasets/dynamoai/safe_eval.Summary
AI-Generated Summary