ChatPaper.aiChatPaper

合併改進自我批評抵抗越獄攻擊

Merging Improves Self-Critique Against Jailbreak Attacks

June 11, 2024
作者: Victor Gallego
cs.AI

摘要

大型語言模型(LLMs)對抗逆向操作(如越獄攻擊)的穩健性仍然是一個重要挑戰。在這項工作中,我們提出了一種方法,增強LLM的自我評論能力,並進一步在經過消毒的合成數據上進行微調。通過添加一個外部評論模型,可以將其與原始模型合併,從而增強自我評論能力並改善LLM對逆向提示的響應的穩健性。我們的結果表明,合併和自我評論的組合可以顯著降低對手的攻擊成功率,從而提供一種有前途的防禦機制來抵禦越獄攻擊。代碼、數據和模型可在以下鏈接找到:https://github.com/vicgalle/merging-self-critique-jailbreaks。
English
The robustness of large language models (LLMs) against adversarial manipulations, such as jailbreak attacks, remains a significant challenge. In this work, we propose an approach that enhances the self-critique capability of the LLM and further fine-tunes it over sanitized synthetic data. This is done with the addition of an external critic model that can be merged with the original, thus bolstering self-critique capabilities and improving the robustness of the LLMs response to adversarial prompts. Our results demonstrate that the combination of merging and self-critique can reduce the attack success rate of adversaries significantly, thus offering a promising defense mechanism against jailbreak attacks. Code, data and models released at https://github.com/vicgalle/merging-self-critique-jailbreaks .

Summary

AI-Generated Summary

PDF40December 8, 2024