彩虹團隊:生成多樣對抗提示的開放式方法
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
February 26, 2024
作者: Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Tim Rocktäschel, Roberta Raileanu
cs.AI
摘要
隨著大型語言模型(LLMs)在許多現實應用中變得日益普及,了解並增強其對用戶輸入的韌性至關重要。現有的識別對抗提示的方法往往專注於特定領域,缺乏多樣性,或需要大量人工標註。為了解決這些限制,我們提出了Rainbow Teaming,一種用於生成多樣對抗提示集合的新方法。Rainbow Teaming將對抗提示生成視為一個質量-多樣性問題,並使用開放式搜索生成既有效又多樣的提示。它可以揭示模型在包括本文中的安全性、問答和網絡安全在內的廣泛領域中的弱點。我們還展示了通過Rainbow Teaming生成的合成數據進行微調,可以提高最先進的LLMs的安全性,而不損害其一般能力和幫助性,為開放式自我改進鋪平道路。
English
As large language models (LLMs) become increasingly prevalent across many
real-world applications, understanding and enhancing their robustness to user
inputs is of paramount importance. Existing methods for identifying adversarial
prompts tend to focus on specific domains, lack diversity, or require extensive
human annotations. To address these limitations, we present Rainbow Teaming, a
novel approach for producing a diverse collection of adversarial prompts.
Rainbow Teaming casts adversarial prompt generation as a quality-diversity
problem, and uses open-ended search to generate prompts that are both effective
and diverse. It can uncover a model's vulnerabilities across a broad range of
domains including, in this paper, safety, question answering, and
cybersecurity. We also demonstrate that fine-tuning on synthetic data generated
by Rainbow Teaming improves the safety of state-of-the-art LLMs without hurting
their general capabilities and helpfulness, paving the path to open-ended
self-improvement.