대규모 언어 모델에서의 약한-강한 탈옥(Weak-to-Strong Jailbreaking)

초록

대규모 언어 모델(LLM)을 정렬(align)하는 데 상당한 노력이 기울여졌음에도 불구하고, 레드 팀 보고서에 따르면 이러한 신중하게 정렬된 LLM도 적대적 프롬프트, 튜닝 또는 디코딩을 통해 여전히 탈옥(jailbroken)될 수 있다. 정렬된 LLM의 탈옥 취약성을 조사한 결과, 탈옥된 모델과 정렬된 모델의 디코딩 분포는 초기 생성 단계에서만 차이가 있음을 관찰했다. 이러한 관찰은 우리가 약한 모델에서 강한 모델로의 탈옥 공격(weak-to-strong jailbreaking attack)을 제안하는 동기가 되었다. 이 공격에서 공격자는 더 작은 안전하지 않거나 정렬된 LLM(예: 7B)을 활용하여 훨씬 더 큰 정렬된 LLM(예: 70B)을 탈옥시킬 수 있다. 탈옥을 위해 더 큰 LLM을 디코딩하는 것과 비교해 계산 및 지연 시간이 최소화된 두 개의 작은 LLM을 추가로 한 번만 디코딩하면 된다. 이 공격의 효율성은 세 개의 다른 조직에서 개발한 다섯 가지 모델에 대한 실험을 통해 입증되었다. 우리의 연구는 이전에 주목받지 못했지만 효율적인 탈옥 방법을 밝혀냄으로써 LLM을 정렬할 때 고려해야 할 시급한 안전 문제를 드러냈다. 초기 시도로서, 우리는 이러한 공격을 방어하기 위한 전략을 제안하지만, 더 발전된 방어 메커니즘을 만드는 것은 여전히 도전적인 과제로 남아 있다. 이 방법을 재현하기 위한 코드는 https://github.com/XuandongZhao/weak-to-strong에서 확인할 수 있다.

English

Although significant efforts have been dedicated to aligning large language models (LLMs), red-teaming reports suggest that these carefully aligned LLMs could still be jailbroken through adversarial prompts, tuning, or decoding. Upon examining the jailbreaking vulnerability of aligned LLMs, we observe that the decoding distributions of jailbroken and aligned models differ only in the initial generations. This observation motivates us to propose the weak-to-strong jailbreaking attack, where adversaries can utilize smaller unsafe/aligned LLMs (e.g., 7B) to guide jailbreaking against significantly larger aligned LLMs (e.g., 70B). To jailbreak, one only needs to additionally decode two smaller LLMs once, which involves minimal computation and latency compared to decoding the larger LLMs. The efficacy of this attack is demonstrated through experiments conducted on five models from three different organizations. Our study reveals a previously unnoticed yet efficient way of jailbreaking, exposing an urgent safety issue that needs to be considered when aligning LLMs. As an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. The code for replicating the method is available at https://github.com/XuandongZhao/weak-to-strong

대규모 언어 모델에서의 약한-강한 탈옥(Weak-to-Strong Jailbreaking)

Weak-to-Strong Jailbreaking on Large Language Models

초록

Support