强迫LLMs做和透露（几乎）任何事情

摘要

最近的研究表明，对大型语言模型（LLMs）进行的对抗性攻击可以将模型“越狱”以发表有害言论。在这项工作中，我们认为对LLMs的对抗性攻击范围远不止于越狱。我们提供了对可能的攻击面和攻击目标的广泛概述。基于一系列具体示例，我们讨论、分类和系统化那些能够诱使出各种意外行为的攻击，例如误导、模型控制、拒绝服务或数据提取。我们通过受控实验分析这些攻击，并发现其中许多源自使用具有编码能力的预训练LLMs的做法，以及普通LLM词汇表中应该出于安全考虑移除的奇怪“故障”标记的持续存在。

English

It has recently been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements. In this work, we argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking. We provide a broad overview of possible attack surfaces and attack goals. Based on a series of concrete examples, we discuss, categorize and systematize attacks that coerce varied unintended behaviors, such as misdirection, model control, denial-of-service, or data extraction. We analyze these attacks in controlled experiments, and find that many of them stem from the practice of pre-training LLMs with coding capabilities, as well as the continued existence of strange "glitch" tokens in common LLM vocabularies that should be removed for security reasons.

强迫LLMs做和透露（几乎）任何事情

Coercing LLMs to do and reveal (almost) anything

摘要

Support