ChatPaper.aiChatPaper

安全感的假象:在“安全”人工智能中的信息泄漏问题

A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses

July 2, 2024
作者: David Glukhov, Ziwen Han, Ilia Shumailov, Vardan Papyan, Nicolas Papernot
cs.AI

摘要

大型语言模型(LLMs)容易受到越狱攻击的影响,即通过某些方法引发有害或通常不允许的输出。安全措施被开发并评估其在抵御越狱攻击方面的有效性,表明了对安全等同于鲁棒性的信念。我们断言,目前的防御机制,如输出过滤器和对齐微调,从根本上是不足以确保模型安全的,而且将继续如此。这些防御措施未能解决由双重意图查询和将无害输出组合以实现有害目标所产生的风险。为了弥补这一关键差距,我们引入了一种称为推理对手的信息论威胁模型,他们利用模型输出中的不允许信息泄漏来实现恶意目标。我们将这些对手与通常研究的只寻求迫使受害模型生成特定不允许输出的安全对手区分开来。我们展示了通过问题分解和响应聚合自动化推理对手的可行性。为了提供安全保证,我们为审查机制定义了一个信息审查标准,限制了不允许信息的泄漏。我们提出了一种防御机制,确保这一限制,并揭示了固有的安全-效用权衡。我们的工作首次从理论上深入理解了发布安全LLMs所需的要求以及涉及的效用成本。
English
Large Language Models (LLMs) are vulnerable to jailbreaksx2013methods to elicit harmful or generally impermissible outputs. Safety measures are developed and assessed on their effectiveness at defending against jailbreak attacks, indicating a belief that safety is equivalent to robustness. We assert that current defense mechanisms, such as output filters and alignment fine-tuning, are, and will remain, fundamentally insufficient for ensuring model safety. These defenses fail to address risks arising from dual-intent queries and the ability to composite innocuous outputs to achieve harmful goals. To address this critical gap, we introduce an information-theoretic threat model called inferential adversaries who exploit impermissible information leakage from model outputs to achieve malicious goals. We distinguish these from commonly studied security adversaries who only seek to force victim models to generate specific impermissible outputs. We demonstrate the feasibility of automating inferential adversaries through question decomposition and response aggregation. To provide safety guarantees, we define an information censorship criterion for censorship mechanisms, bounding the leakage of impermissible information. We propose a defense mechanism which ensures this bound and reveal an intrinsic safety-utility trade-off. Our work provides the first theoretically grounded understanding of the requirements for releasing safe LLMs and the utility costs involved.

Summary

AI-Generated Summary

PDF91November 28, 2024