紅隊測試 GPT-4V：GPT-4V 對抗單/多模態越獄攻擊安全嗎？

摘要

已提出各種越獄攻擊以紅隊 Large Language Models (LLMs) 並揭示了LLMs的弱點防護。此外，一些方法不僅限於文本模態，還通過扭曲視覺輸入將越獄攻擊擴展到 Multimodal Large Language Models (MLLMs)。然而，缺乏通用評估基準使性能再現和公平比較變得複雜。此外，對於封閉源最先進 (SOTA) 模型的全面評估存在不足，特別是對於 MLLMs，如 GPT-4V。為了解決這些問題，本研究首先建立了一個包含1445個有害問題、涵蓋11種不同安全策略的全面越獄評估數據集。基於此數據集，在11種不同的LLMs和MLLMs上進行了廣泛的紅隊實驗，包括 SOTA 專有模型和開源模型。然後對評估結果進行深入分析，發現 (1) GPT4 和 GPT-4V 在抵抗越獄攻擊方面表現比開源LLMs和MLLMs更為堅固。 (2) Llama2 和 Qwen-VL-Chat 相對於其他開源模型更為堅固。 (3) 與文本越獄方法相比，視覺越獄方法的可轉移性相對有限。數據集和代碼可在此處找到：https://anonymous.4open.science/r/red_teaming_gpt4-C1CE/README.md。

English

Various jailbreak attacks have been proposed to red-team Large Language Models (LLMs) and revealed the vulnerable safeguards of LLMs. Besides, some methods are not limited to the textual modality and extend the jailbreak attack to Multimodal Large Language Models (MLLMs) by perturbing the visual input. However, the absence of a universal evaluation benchmark complicates the performance reproduction and fair comparison. Besides, there is a lack of comprehensive evaluation of closed-source state-of-the-art (SOTA) models, especially MLLMs, such as GPT-4V. To address these issues, this work first builds a comprehensive jailbreak evaluation dataset with 1445 harmful questions covering 11 different safety policies. Based on this dataset, extensive red-teaming experiments are conducted on 11 different LLMs and MLLMs, including both SOTA proprietary models and open-source models. We then conduct a deep analysis of the evaluated results and find that (1) GPT4 and GPT-4V demonstrate better robustness against jailbreak attacks compared to open-source LLMs and MLLMs. (2) Llama2 and Qwen-VL-Chat are more robust compared to other open-source models. (3) The transferability of visual jailbreak methods is relatively limited compared to textual jailbreak methods. The dataset and code can be found here https://anonymous.4open.science/r/red_teaming_gpt4-C1CE/README.md .

紅隊測試 GPT-4V：GPT-4V 對抗單/多模態越獄攻擊安全嗎？

Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?

摘要

Support