合并改进了对越狱攻击的自我审查。

摘要

大型语言模型（LLMs）对抗篡改的鲁棒性，如越狱攻击，仍然是一个重大挑战。在这项工作中，我们提出了一种方法，增强LLM的自我批评能力，并通过经过清理的合成数据进一步微调它。这是通过添加一个外部评论模型来实现的，该模型可以与原模型合并，从而增强自我批评能力，并改善LLM对抗性提示的响应鲁棒性。我们的结果表明，合并和自我批评的结合可以显著降低对手的攻击成功率，从而提供一种有前途的防御机制来抵御越狱攻击。代码、数据和模型发布在https://github.com/vicgalle/merging-self-critique-jailbreaks。

English

The robustness of large language models (LLMs) against adversarial manipulations, such as jailbreak attacks, remains a significant challenge. In this work, we propose an approach that enhances the self-critique capability of the LLM and further fine-tunes it over sanitized synthetic data. This is done with the addition of an external critic model that can be merged with the original, thus bolstering self-critique capabilities and improving the robustness of the LLMs response to adversarial prompts. Our results demonstrate that the combination of merging and self-critique can reduce the attack success rate of adversaries significantly, thus offering a promising defense mechanism against jailbreak attacks. Code, data and models released at https://github.com/vicgalle/merging-self-critique-jailbreaks .

合并改进了对越狱攻击的自我审查。

Merging Improves Self-Critique Against Jailbreak Attacks

摘要

Support