红队测试 GPT-4V：GPT-4V 对抗单/多模态越狱攻击安全吗？

摘要

已经提出了各种越狱攻击来对大型语言模型（LLMs）进行红队测试，并揭示了LLMs的脆弱防护措施。此外，一些方法不仅限于文本模态，还通过扰乱视觉输入将越狱攻击扩展到多模态大型语言模型（MLLMs）。然而，缺乏一个通用的评估基准使性能再现和公平比较变得复杂。此外，对于封闭源最先进（SOTA）模型的综合评估存在不足，特别是对于MLLMs，如GPT-4V。为了解决这些问题，本研究首先构建了一个包含1445个有害问题的全面越狱评估数据集，涵盖11种不同的安全策略。基于这个数据集，在11种不同的LLMs和MLLMs上进行了广泛的红队实验，包括SOTA专有模型和开源模型。然后对评估结果进行了深入分析，发现：（1）与开源LLMs和MLLMs相比，GPT4和GPT-4V对越狱攻击表现出更好的鲁棒性。（2）与其他开源模型相比，Llama2和Qwen-VL-Chat更具鲁棒性。（3）与文本越狱方法相比，视觉越狱方法的可转移性相对有限。数据集和代码可在以下链接找到：https://anonymous.4open.science/r/red_teaming_gpt4-C1CE/README.md。

English

Various jailbreak attacks have been proposed to red-team Large Language Models (LLMs) and revealed the vulnerable safeguards of LLMs. Besides, some methods are not limited to the textual modality and extend the jailbreak attack to Multimodal Large Language Models (MLLMs) by perturbing the visual input. However, the absence of a universal evaluation benchmark complicates the performance reproduction and fair comparison. Besides, there is a lack of comprehensive evaluation of closed-source state-of-the-art (SOTA) models, especially MLLMs, such as GPT-4V. To address these issues, this work first builds a comprehensive jailbreak evaluation dataset with 1445 harmful questions covering 11 different safety policies. Based on this dataset, extensive red-teaming experiments are conducted on 11 different LLMs and MLLMs, including both SOTA proprietary models and open-source models. We then conduct a deep analysis of the evaluated results and find that (1) GPT4 and GPT-4V demonstrate better robustness against jailbreak attacks compared to open-source LLMs and MLLMs. (2) Llama2 and Qwen-VL-Chat are more robust compared to other open-source models. (3) The transferability of visual jailbreak methods is relatively limited compared to textual jailbreak methods. The dataset and code can be found here https://anonymous.4open.science/r/red_teaming_gpt4-C1CE/README.md .

红队测试 GPT-4V：GPT-4V 对抗单/多模态越狱攻击安全吗？

Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?

摘要

Support