LLM을 (거의) 모든 것을 수행하고 공개하도록 강제하기

초록

최근 대규모 언어 모델(LLMs)에 대한 적대적 공격이 모델을 "탈옥"시켜 유해한 발언을 하도록 만들 수 있다는 것이 밝혀졌다. 본 연구에서는 LLMs에 대한 적대적 공격의 범위가 단순한 탈옥을 넘어 훨씬 더 크다는 주장을 제시한다. 우리는 가능한 공격 표면과 공격 목표에 대한 광범위한 개요를 제공한다. 일련의 구체적인 예시를 바탕으로, 오도, 모델 제어, 서비스 거부, 데이터 추출 등 다양한 의도하지 않은 행동을 유도하는 공격들을 논의하고 분류하며 체계화한다. 우리는 이러한 공격들을 통제된 실험에서 분석하고, 그 중 다수가 코딩 능력을 갖춘 LLMs를 사전 학습시키는 관행과 보안상 제거되어야 할 이상한 "글리치" 토큰이 일반적인 LLM 어휘에 계속 존재하는 데서 비롯된다는 사실을 발견했다.

English

It has recently been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements. In this work, we argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking. We provide a broad overview of possible attack surfaces and attack goals. Based on a series of concrete examples, we discuss, categorize and systematize attacks that coerce varied unintended behaviors, such as misdirection, model control, denial-of-service, or data extraction. We analyze these attacks in controlled experiments, and find that many of them stem from the practice of pre-training LLMs with coding capabilities, as well as the continued existence of strange "glitch" tokens in common LLM vocabularies that should be removed for security reasons.

LLM을 (거의) 모든 것을 수행하고 공개하도록 강제하기

Coercing LLMs to do and reveal (almost) anything

초록

Support