LLMに（ほぼ）何でもさせ、明かさせる方法

要旨

最近の研究で、大規模言語モデル（LLM）に対する敵対的攻撃が、モデルを「脱獄」させて有害な発言をさせる可能性があることが示されました。本研究では、LLMに対する敵対的攻撃の範囲は、単なる脱獄よりもはるかに広いと主張します。我々は、可能な攻撃対象と攻撃目標の広範な概要を提供します。具体的な例を基に、誤誘導、モデル制御、サービス拒否、データ抽出など、様々な意図しない行動を強制する攻撃について議論し、分類し、体系化します。これらの攻撃を制御された実験で分析した結果、その多くが、コーディング能力を持つLLMを事前学習する慣行や、セキュリティ上の理由で削除されるべき奇妙な「グリッチ」トークンが一般的なLLMの語彙に残存していることに起因していることがわかりました。

English

It has recently been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements. In this work, we argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking. We provide a broad overview of possible attack surfaces and attack goals. Based on a series of concrete examples, we discuss, categorize and systematize attacks that coerce varied unintended behaviors, such as misdirection, model control, denial-of-service, or data extraction. We analyze these attacks in controlled experiments, and find that many of them stem from the practice of pre-training LLMs with coding capabilities, as well as the continued existence of strange "glitch" tokens in common LLM vocabularies that should be removed for security reasons.

LLMに（ほぼ）何でもさせ、明かさせる方法

Coercing LLMs to do and reveal (almost) anything

要旨

Support