彩虹團隊：生成多樣對抗提示的開放式方法

摘要

隨著大型語言模型（LLMs）在許多現實應用中變得日益普及，了解並增強其對用戶輸入的韌性至關重要。現有的識別對抗提示的方法往往專注於特定領域，缺乏多樣性，或需要大量人工標註。為了解決這些限制，我們提出了Rainbow Teaming，一種用於生成多樣對抗提示集合的新方法。Rainbow Teaming將對抗提示生成視為一個質量-多樣性問題，並使用開放式搜索生成既有效又多樣的提示。它可以揭示模型在包括本文中的安全性、問答和網絡安全在內的廣泛領域中的弱點。我們還展示了通過Rainbow Teaming生成的合成數據進行微調，可以提高最先進的LLMs的安全性，而不損害其一般能力和幫助性，為開放式自我改進鋪平道路。

English

As large language models (LLMs) become increasingly prevalent across many real-world applications, understanding and enhancing their robustness to user inputs is of paramount importance. Existing methods for identifying adversarial prompts tend to focus on specific domains, lack diversity, or require extensive human annotations. To address these limitations, we present Rainbow Teaming, a novel approach for producing a diverse collection of adversarial prompts. Rainbow Teaming casts adversarial prompt generation as a quality-diversity problem, and uses open-ended search to generate prompts that are both effective and diverse. It can uncover a model's vulnerabilities across a broad range of domains including, in this paper, safety, question answering, and cybersecurity. We also demonstrate that fine-tuning on synthetic data generated by Rainbow Teaming improves the safety of state-of-the-art LLMs without hurting their general capabilities and helpfulness, paving the path to open-ended self-improvement.

彩虹團隊：生成多樣對抗提示的開放式方法

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

摘要

Support