安全感的假象:在「安全」人工智慧中的資訊洩漏問題
A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses
July 2, 2024
作者: David Glukhov, Ziwen Han, Ilia Shumailov, Vardan Papyan, Nicolas Papernot
cs.AI
摘要
大型語言模型(LLMs)容易受到越獄攻擊的威脅,即引發有害或一般不允許的輸出的方法。安全措施被開發並評估其在防禦越獄攻擊方面的有效性,表明安全等同於健壯性的信念。我們主張目前的防禦機制,如輸出過濾器和對齊微調,基本上是不足以確保模型安全的,並將繼續如此。這些防禦措施未能解決由雙重意圖查詢和組合無害輸出以實現有害目標所產生的風險。為了解決這一關鍵缺口,我們引入了一個名為推論對手的信息理論威脅模型,該對手利用模型輸出中的不允許信息洩漏來實現惡意目標。我們將這些對手與通常研究的只尋求迫使受害模型生成特定不允許輸出的安全對手區分開來。我們展示了通過問題分解和回應聚合自動化推論對手的可行性。為了提供安全保證,我們為審查機制定義了一個信息審查標準,限制不允許信息的洩漏。我們提出了一種防禦機制,確保這種限制,並揭示了一種固有的安全效用平衡。我們的工作首次在理論上深入理解了釋放安全LLMs所需的要求以及相應的效用成本。
English
Large Language Models (LLMs) are vulnerable to
jailbreaksx2013methods to elicit harmful or generally impermissible
outputs. Safety measures are developed and assessed on their effectiveness at
defending against jailbreak attacks, indicating a belief that safety is
equivalent to robustness. We assert that current defense mechanisms, such as
output filters and alignment fine-tuning, are, and will remain, fundamentally
insufficient for ensuring model safety. These defenses fail to address risks
arising from dual-intent queries and the ability to composite innocuous outputs
to achieve harmful goals. To address this critical gap, we introduce an
information-theoretic threat model called inferential adversaries who exploit
impermissible information leakage from model outputs to achieve malicious
goals. We distinguish these from commonly studied security adversaries who only
seek to force victim models to generate specific impermissible outputs. We
demonstrate the feasibility of automating inferential adversaries through
question decomposition and response aggregation. To provide safety guarantees,
we define an information censorship criterion for censorship mechanisms,
bounding the leakage of impermissible information. We propose a defense
mechanism which ensures this bound and reveal an intrinsic safety-utility
trade-off. Our work provides the first theoretically grounded understanding of
the requirements for releasing safe LLMs and the utility costs involved.Summary
AI-Generated Summary