通过显式有害提示破解商用黑箱大语言模型

摘要

评估越狱攻击的挑战在于，当提示语并未明显表现出危害性或未能引发有害输出时。遗憾的是，许多现有的红队数据集包含了此类不适宜的提示语。为了准确评估攻击，这些数据集需经过恶意性评估与清理。然而，现有的恶意内容检测方法要么依赖人工标注，耗时耗力；要么依赖大型语言模型（LLMs），在识别有害类型时准确性不一。为了在准确性与效率之间取得平衡，我们提出了一种名为MDH（基于LLM并辅以人工协助的恶意内容检测）的混合评估框架，该框架结合了基于LLM的标注与最小化的人工监督，并将其应用于数据集清理及越狱响应的检测中。此外，我们发现精心设计的开发者信息能显著提升越狱成功率，这促使我们提出了两种新策略：D-Attack，利用上下文模拟；以及DH-CoT，融合了被劫持的思维链。相关代码、数据集、判断结果及检测成果将发布于GitHub仓库：https://github.com/AlienZhang1996/DH-CoT。

English

Evaluating jailbreak attacks is challenging when prompts are not overtly harmful or fail to induce harmful outputs. Unfortunately, many existing red-teaming datasets contain such unsuitable prompts. To evaluate attacks accurately, these datasets need to be assessed and cleaned for maliciousness. However, existing malicious content detection methods rely on either manual annotation, which is labor-intensive, or large language models (LLMs), which have inconsistent accuracy in harmful types. To balance accuracy and efficiency, we propose a hybrid evaluation framework named MDH (Malicious content Detection based on LLMs with Human assistance) that combines LLM-based annotation with minimal human oversight, and apply it to dataset cleaning and detection of jailbroken responses. Furthermore, we find that well-crafted developer messages can significantly boost jailbreak success, leading us to propose two new strategies: D-Attack, which leverages context simulation, and DH-CoT, which incorporates hijacked chains of thought. The Codes, datasets, judgements, and detection results will be released in github repository: https://github.com/AlienZhang1996/DH-CoT.

通过显式有害提示破解商用黑箱大语言模型

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

摘要

Support