紅隊測試 GPT-4V:GPT-4V 對抗單/多模態越獄攻擊安全嗎?
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?
April 4, 2024
作者: Shuo Chen, Zhen Han, Bailan He, Zifeng Ding, Wenqian Yu, Philip Torr, Volker Tresp, Jindong Gu
cs.AI
摘要
已提出各種越獄攻擊以紅隊 Large Language Models (LLMs) 並揭示了LLMs的弱點防護。此外,一些方法不僅限於文本模態,還通過扭曲視覺輸入將越獄攻擊擴展到 Multimodal Large Language Models (MLLMs)。然而,缺乏通用評估基準使性能再現和公平比較變得複雜。此外,對於封閉源最先進 (SOTA) 模型的全面評估存在不足,特別是對於 MLLMs,如 GPT-4V。為了解決這些問題,本研究首先建立了一個包含1445個有害問題、涵蓋11種不同安全策略的全面越獄評估數據集。基於此數據集,在11種不同的LLMs和MLLMs上進行了廣泛的紅隊實驗,包括 SOTA 專有模型和開源模型。然後對評估結果進行深入分析,發現 (1) GPT4 和 GPT-4V 在抵抗越獄攻擊方面表現比開源LLMs和MLLMs更為堅固。 (2) Llama2 和 Qwen-VL-Chat 相對於其他開源模型更為堅固。 (3) 與文本越獄方法相比,視覺越獄方法的可轉移性相對有限。數據集和代碼可在此處找到:https://anonymous.4open.science/r/red_teaming_gpt4-C1CE/README.md。
English
Various jailbreak attacks have been proposed to red-team Large Language
Models (LLMs) and revealed the vulnerable safeguards of LLMs. Besides, some
methods are not limited to the textual modality and extend the jailbreak attack
to Multimodal Large Language Models (MLLMs) by perturbing the visual input.
However, the absence of a universal evaluation benchmark complicates the
performance reproduction and fair comparison. Besides, there is a lack of
comprehensive evaluation of closed-source state-of-the-art (SOTA) models,
especially MLLMs, such as GPT-4V. To address these issues, this work first
builds a comprehensive jailbreak evaluation dataset with 1445 harmful questions
covering 11 different safety policies. Based on this dataset, extensive
red-teaming experiments are conducted on 11 different LLMs and MLLMs, including
both SOTA proprietary models and open-source models. We then conduct a deep
analysis of the evaluated results and find that (1) GPT4 and GPT-4V demonstrate
better robustness against jailbreak attacks compared to open-source LLMs and
MLLMs. (2) Llama2 and Qwen-VL-Chat are more robust compared to other
open-source models. (3) The transferability of visual jailbreak methods is
relatively limited compared to textual jailbreak methods. The dataset and code
can be found here
https://anonymous.4open.science/r/red_teaming_gpt4-C1CE/README.md .Summary
AI-Generated Summary