彩虹团队：生成多样对抗提示的开放式方法

摘要

随着大型语言模型（LLMs）在许多实际应用中变得越来越普遍，理解和增强其对用户输入的稳健性变得至关重要。现有用于识别对抗性提示的方法往往专注于特定领域，缺乏多样性，或需要大量人工标注。为了解决这些限制，我们提出了Rainbow Teaming，这是一种用于生成多样化对抗性提示集合的新方法。Rainbow Teaming将对抗性提示生成视为一个质量-多样性问题，并使用开放式搜索来生成既有效又多样化的提示。它可以揭示模型在包括本文中的安全性、问答和网络安全在内的广泛领域中的漏洞。我们还证明，对Rainbow Teaming生成的合成数据进行微调可以提高最先进的LLMs的安全性，而不会损害它们的通用能力和实用性，为开放式自我改进铺平道路。

English

As large language models (LLMs) become increasingly prevalent across many real-world applications, understanding and enhancing their robustness to user inputs is of paramount importance. Existing methods for identifying adversarial prompts tend to focus on specific domains, lack diversity, or require extensive human annotations. To address these limitations, we present Rainbow Teaming, a novel approach for producing a diverse collection of adversarial prompts. Rainbow Teaming casts adversarial prompt generation as a quality-diversity problem, and uses open-ended search to generate prompts that are both effective and diverse. It can uncover a model's vulnerabilities across a broad range of domains including, in this paper, safety, question answering, and cybersecurity. We also demonstrate that fine-tuning on synthetic data generated by Rainbow Teaming improves the safety of state-of-the-art LLMs without hurting their general capabilities and helpfulness, paving the path to open-ended self-improvement.

彩虹团队：生成多样对抗提示的开放式方法

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

摘要

Support