強迫語言模型執行並揭示（幾乎）所有內容

摘要

最近的研究表明，對大型語言模型（LLMs）進行對抗性攻擊可以將模型“越獄”，使其發出有害陳述。在這項工作中，我們認為對LLMs的對抗性攻擊範疇遠不僅僅是越獄。我們提供了對可能的攻擊表面和攻擊目標的廣泛概述。基於一系列具體示例，我們討論、分類並系統化了那些強迫產生各種意外行為的攻擊，例如誤導、模型控制、拒絕服務或數據提取。我們在受控實驗中分析這些攻擊，發現其中許多源於使用具有編碼能力的預訓練LLMs的實踐，以及常見LLM詞彙表中應出於安全考量而移除的奇怪“故障”標記的持續存在。

English

It has recently been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements. In this work, we argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking. We provide a broad overview of possible attack surfaces and attack goals. Based on a series of concrete examples, we discuss, categorize and systematize attacks that coerce varied unintended behaviors, such as misdirection, model control, denial-of-service, or data extraction. We analyze these attacks in controlled experiments, and find that many of them stem from the practice of pre-training LLMs with coding capabilities, as well as the continued existence of strange "glitch" tokens in common LLM vocabularies that should be removed for security reasons.

強迫語言模型執行並揭示（幾乎）所有內容

Coercing LLMs to do and reveal (almost) anything

摘要

Support