利用明確有害提示破解商用黑箱大型語言模型

摘要

評估越獄攻擊的挑戰在於，當提示並非明顯有害或未能誘導出有害輸出時。遺憾的是，許多現有的紅隊數據集包含此類不適宜的提示。為了準確評估攻擊，這些數據集需要進行惡意性評估和清理。然而，現有的惡意內容檢測方法要么依賴於人工標註，這既耗時又費力；要么依賴於大型語言模型（LLMs），而這些模型在有害類型上的準確性並不一致。為了在準確性和效率之間取得平衡，我們提出了一種名為MDH（基於LLMs並輔以人工協助的惡意內容檢測）的混合評估框架，該框架結合了基於LLM的標註與最小化的人工監督，並將其應用於數據集清理和越獄響應的檢測。此外，我們發現精心設計的開發者消息能顯著提升越獄成功率，這促使我們提出了兩種新策略：D-Attack，利用上下文模擬；以及DH-CoT，融合了被劫持的思維鏈。相關代碼、數據集、判斷結果及檢測結果將發佈於GitHub倉庫：https://github.com/AlienZhang1996/DH-CoT。

English

Evaluating jailbreak attacks is challenging when prompts are not overtly harmful or fail to induce harmful outputs. Unfortunately, many existing red-teaming datasets contain such unsuitable prompts. To evaluate attacks accurately, these datasets need to be assessed and cleaned for maliciousness. However, existing malicious content detection methods rely on either manual annotation, which is labor-intensive, or large language models (LLMs), which have inconsistent accuracy in harmful types. To balance accuracy and efficiency, we propose a hybrid evaluation framework named MDH (Malicious content Detection based on LLMs with Human assistance) that combines LLM-based annotation with minimal human oversight, and apply it to dataset cleaning and detection of jailbroken responses. Furthermore, we find that well-crafted developer messages can significantly boost jailbreak success, leading us to propose two new strategies: D-Attack, which leverages context simulation, and DH-CoT, which incorporates hijacked chains of thought. The Codes, datasets, judgements, and detection results will be released in github repository: https://github.com/AlienZhang1996/DH-CoT.

利用明確有害提示破解商用黑箱大型語言模型

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

摘要

Support