ChatPaper.aiChatPaper

合并改进了对越狱攻击的自我审查。

Merging Improves Self-Critique Against Jailbreak Attacks

June 11, 2024
作者: Victor Gallego
cs.AI

摘要

大型语言模型(LLMs)对抗篡改的鲁棒性,如越狱攻击,仍然是一个重大挑战。在这项工作中,我们提出了一种方法,增强LLM的自我批评能力,并通过经过清理的合成数据进一步微调它。这是通过添加一个外部评论模型来实现的,该模型可以与原模型合并,从而增强自我批评能力,并改善LLM对抗性提示的响应鲁棒性。我们的结果表明,合并和自我批评的结合可以显著降低对手的攻击成功率,从而提供一种有前途的防御机制来抵御越狱攻击。代码、数据和模型发布在https://github.com/vicgalle/merging-self-critique-jailbreaks。
English
The robustness of large language models (LLMs) against adversarial manipulations, such as jailbreak attacks, remains a significant challenge. In this work, we propose an approach that enhances the self-critique capability of the LLM and further fine-tunes it over sanitized synthetic data. This is done with the addition of an external critic model that can be merged with the original, thus bolstering self-critique capabilities and improving the robustness of the LLMs response to adversarial prompts. Our results demonstrate that the combination of merging and self-critique can reduce the attack success rate of adversaries significantly, thus offering a promising defense mechanism against jailbreak attacks. Code, data and models released at https://github.com/vicgalle/merging-self-critique-jailbreaks .
PDF40December 8, 2024