探索、建立、利用：从零开始对语言模型进行红队测试

摘要

部署大型语言模型（LLMs）可能会带来有害输出，如有毒或不诚实的言论。先前的工作引入了工具，以引发有害输出，以便识别和减轻这些风险。虽然这是确保语言模型安全的宝贵步骤，但这些方法通常依赖于一个用于不良输出的预先存在的分类器。这限制了它们在已知有害行为类型的情况下的应用。然而，这种方法忽略了红队行动的一个核心挑战：开发模型可能展示的行为的上下文理解。此外，当这样的分类器已经存在时，红队行动的边际价值有限，因为可以简单地使用分类器来过滤训练数据或模型输出。在这项工作中，我们考虑了在对手从高级抽象的不良行为规范出发的情况下进行红队行动。预期红队将完善/扩展这一规范，并确定从模型中引发这种行为的方法。我们的红队行动框架包括三个步骤：1）探索模型在期望上下文中的行为；2）建立不良行为的度量（例如，一个经过训练以反映人类评估的分类器）；和3）利用这一度量和已建立的红队行动方法来利用模型的缺陷。我们将这种方法应用于红队 GPT-2 和 GPT-3 模型，系统地发现引发有毒和不诚实言论的提示类别。在这个过程中，我们还构建并发布了由人类主体标记为常识真实、常识虚假或其他的 20,000 条言论的 CommonClaim 数据集。代码可在 https://github.com/thestephencasper/explore_establish_exploit_llms 获取。CommonClaim 可在 https://github.com/thestephencasper/common_claim 获取。

English

Deploying Large language models (LLMs) can pose hazards from harmful outputs such as toxic or dishonest speech. Prior work has introduced tools that elicit harmful outputs in order to identify and mitigate these risks. While this is a valuable step toward securing language models, these approaches typically rely on a pre-existing classifier for undesired outputs. This limits their application to situations where the type of harmful behavior is known with precision beforehand. However, this skips a central challenge of red teaming: developing a contextual understanding of the behaviors that a model can exhibit. Furthermore, when such a classifier already exists, red teaming has limited marginal value because the classifier could simply be used to filter training data or model outputs. In this work, we consider red teaming under the assumption that the adversary is working from a high-level, abstract specification of undesired behavior. The red team is expected to refine/extend this specification and identify methods to elicit this behavior from the model. Our red teaming framework consists of three steps: 1) Exploring the model's behavior in the desired context; 2) Establishing a measurement of undesired behavior (e.g., a classifier trained to reflect human evaluations); and 3) Exploiting the model's flaws using this measure and an established red teaming methodology. We apply this approach to red team GPT-2 and GPT-3 models to systematically discover classes of prompts that elicit toxic and dishonest statements. In doing so, we also construct and release the CommonClaim dataset of 20,000 statements that have been labeled by human subjects as common-knowledge-true, common-knowledge-false, or neither. Code is available at https://github.com/thestephencasper/explore_establish_exploit_llms. CommonClaim is available at https://github.com/thestephencasper/common_claim.

探索、建立、利用：从零开始对语言模型进行红队测试

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

摘要

Support