병합(Merging)이 Jailbreak 공격에 대한 자기 비판(Self-Critique)을 개선한다

초록

대규모 언어 모델(LLM)의 견고성, 특히 탈옥 공격(jailbreak attack)과 같은 적대적 조작에 대한 저항력은 여전히 중요한 과제로 남아 있습니다. 본 연구에서는 LLM의 자기 비판(self-critique) 능력을 강화하고, 정제된 합성 데이터를 통해 추가적으로 미세 조정(fine-tuning)하는 접근 방식을 제안합니다. 이를 위해 외부 비평가 모델(critic model)을 추가하여 원본 모델과 통합함으로써 자기 비판 능력을 강화하고, 적대적 프롬프트에 대한 LLM의 견고성을 개선합니다. 우리의 실험 결과는 통합(merging)과 자기 비판의 조합이 공격자의 성공률을 크게 감소시킬 수 있음을 보여주며, 이는 탈옥 공격에 대한 유망한 방어 메커니즘을 제공합니다. 코드, 데이터 및 모델은 https://github.com/vicgalle/merging-self-critique-jailbreaks 에 공개되었습니다.

English

The robustness of large language models (LLMs) against adversarial manipulations, such as jailbreak attacks, remains a significant challenge. In this work, we propose an approach that enhances the self-critique capability of the LLM and further fine-tunes it over sanitized synthetic data. This is done with the addition of an external critic model that can be merged with the original, thus bolstering self-critique capabilities and improving the robustness of the LLMs response to adversarial prompts. Our results demonstrate that the combination of merging and self-critique can reduce the attack success rate of adversaries significantly, thus offering a promising defense mechanism against jailbreak attacks. Code, data and models released at https://github.com/vicgalle/merging-self-critique-jailbreaks .

병합(Merging)이 Jailbreak 공격에 대한 자기 비판(Self-Critique)을 개선한다

Merging Improves Self-Critique Against Jailbreak Attacks

초록

Summary

Support

Support