探索、確立、活用：ゼロから始める言語モデルのレッドチーミング

要旨

大規模言語モデル（LLM）を展開する際には、有害な出力、例えば毒性のある発言や虚偽の発言などが危険をもたらす可能性があります。これまでの研究では、これらのリスクを特定し軽減するために、有害な出力を引き出すツールが導入されてきました。これは言語モデルの安全性を確保するための重要なステップではありますが、これらのアプローチは通常、望ましくない出力を識別するための既存の分類器に依存しています。これにより、その手法は、有害な行動の種類が事前に正確に把握されている状況に限定されてしまいます。しかし、これはレッドチーミングの中心的な課題、つまりモデルが示す可能性のある行動を文脈的に理解することを見落としています。さらに、そのような分類器が既に存在する場合、レッドチーミングの限界的な価値は低くなります。なぜなら、その分類器を単にトレーニングデータやモデルの出力をフィルタリングするために使用できるからです。本研究では、敵対者が望ましくない行動の高レベルで抽象的な仕様から作業しているという仮定の下でレッドチーミングを検討します。レッドチームは、この仕様を洗練・拡張し、モデルからその行動を引き出す方法を特定することが期待されます。私たちのレッドチーミングフレームワークは、以下の3つのステップで構成されています：1）望ましい文脈におけるモデルの行動を探索する、2）望ましくない行動の測定基準を確立する（例えば、人間の評価を反映するように訓練された分類器）、3）この測定基準と確立されたレッドチーミング手法を使用してモデルの欠陥を利用する。このアプローチをGPT-2およびGPT-3モデルに適用し、毒性や虚偽の発言を引き出すプロンプトのクラスを体系的に発見します。これにより、20,000のステートメントからなるCommonClaimデータセットも構築し、公開しました。これらのステートメントは、人間の被験者によって「常識的に真」「常識的に偽」「どちらでもない」とラベル付けされています。コードはhttps://github.com/thestephencasper/explore_establish_exploit_llmsで、CommonClaimはhttps://github.com/thestephencasper/common_claimで利用可能です。

English

Deploying Large language models (LLMs) can pose hazards from harmful outputs such as toxic or dishonest speech. Prior work has introduced tools that elicit harmful outputs in order to identify and mitigate these risks. While this is a valuable step toward securing language models, these approaches typically rely on a pre-existing classifier for undesired outputs. This limits their application to situations where the type of harmful behavior is known with precision beforehand. However, this skips a central challenge of red teaming: developing a contextual understanding of the behaviors that a model can exhibit. Furthermore, when such a classifier already exists, red teaming has limited marginal value because the classifier could simply be used to filter training data or model outputs. In this work, we consider red teaming under the assumption that the adversary is working from a high-level, abstract specification of undesired behavior. The red team is expected to refine/extend this specification and identify methods to elicit this behavior from the model. Our red teaming framework consists of three steps: 1) Exploring the model's behavior in the desired context; 2) Establishing a measurement of undesired behavior (e.g., a classifier trained to reflect human evaluations); and 3) Exploiting the model's flaws using this measure and an established red teaming methodology. We apply this approach to red team GPT-2 and GPT-3 models to systematically discover classes of prompts that elicit toxic and dishonest statements. In doing so, we also construct and release the CommonClaim dataset of 20,000 statements that have been labeled by human subjects as common-knowledge-true, common-knowledge-false, or neither. Code is available at https://github.com/thestephencasper/explore_establish_exploit_llms. CommonClaim is available at https://github.com/thestephencasper/common_claim.

探索、確立、活用：ゼロから始める言語モデルのレッドチーミング

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

要旨

Support