レインボーチーミング：多様な敵対的プロンプトのオープンエンド生成

要旨

大規模言語モデル（LLM）が現実世界の多くのアプリケーションでますます普及するにつれ、ユーザー入力に対するその堅牢性を理解し向上させることが極めて重要となっています。既存の敵対的プロンプトを特定する手法は、特定のドメインに焦点を当てる傾向があり、多様性に欠けたり、大量の人間による注釈を必要としたりします。これらの制限に対処するため、我々は多様な敵対的プロンプトのコレクションを生成する新しいアプローチであるRainbow Teamingを提案します。Rainbow Teamingは、敵対的プロンプト生成を品質と多様性の問題として捉え、オープンエンドの探索を用いて効果的かつ多様なプロンプトを生成します。これにより、本論文では安全性、質問応答、サイバーセキュリティなど、幅広いドメインにわたるモデルの脆弱性を明らかにすることができます。また、Rainbow Teamingによって生成された合成データを用いたファインチューニングが、最先端のLLMの安全性を向上させ、その一般的な能力や有用性を損なうことなく、オープンエンドの自己改善への道を開くことを実証します。

English

As large language models (LLMs) become increasingly prevalent across many real-world applications, understanding and enhancing their robustness to user inputs is of paramount importance. Existing methods for identifying adversarial prompts tend to focus on specific domains, lack diversity, or require extensive human annotations. To address these limitations, we present Rainbow Teaming, a novel approach for producing a diverse collection of adversarial prompts. Rainbow Teaming casts adversarial prompt generation as a quality-diversity problem, and uses open-ended search to generate prompts that are both effective and diverse. It can uncover a model's vulnerabilities across a broad range of domains including, in this paper, safety, question answering, and cybersecurity. We also demonstrate that fine-tuning on synthetic data generated by Rainbow Teaming improves the safety of state-of-the-art LLMs without hurting their general capabilities and helpfulness, paving the path to open-ended self-improvement.

レインボーチーミング：多様な敵対的プロンプトのオープンエンド生成

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

要旨

Support