探索、建立、利用：從頭開始對語言模型進行紅隊測試

摘要

部署大型語言模型（LLMs）可能存在來自有害輸出的危險，例如有毒或不誠實的言論。先前的工作引入了工具，以引出有害輸出，以便識別和減輕這些風險。儘管這是確保語言模型安全的一個寶貴步驟，但這些方法通常依賴於現有的用於不良輸出的分類器。這限制了它們僅適用於在事先精確知道有害行為類型的情況。然而，這忽略了紅隊測試的一個核心挑戰：發展對模型可能展現的行為的情境理解。此外，當這樣的分類器已經存在時，紅隊測試的邊際價值有限，因為可以簡單地使用分類器來過濾訓練數據或模型輸出。在這項工作中，我們考慮紅隊測試，假設對手是根據高層次、抽象的不良行為規範進行工作。預期紅隊將完善/擴展此規範並確定從模型中引出此行為的方法。我們的紅隊測試框架包括三個步驟：1）在所需上下文中探索模型的行為；2）建立不良行為的衡量（例如，訓練一個反映人類評估的分類器）；和3）利用此衡量和已建立的紅隊測試方法來利用模型的缺陷。我們將此方法應用於紅隊測試 GPT-2 和 GPT-3 模型，以系統地發現引出有毒和不誠實陳述的提示類別。在此過程中，我們還構建並發布了由人類標記為常識真實、常識虛假或其他的 20,000 個陳述的 CommonClaim 數據集。代碼可在 https://github.com/thestephencasper/explore_establish_exploit_llms 找到。CommonClaim 可在 https://github.com/thestephencasper/common_claim 找到。

English

Deploying Large language models (LLMs) can pose hazards from harmful outputs such as toxic or dishonest speech. Prior work has introduced tools that elicit harmful outputs in order to identify and mitigate these risks. While this is a valuable step toward securing language models, these approaches typically rely on a pre-existing classifier for undesired outputs. This limits their application to situations where the type of harmful behavior is known with precision beforehand. However, this skips a central challenge of red teaming: developing a contextual understanding of the behaviors that a model can exhibit. Furthermore, when such a classifier already exists, red teaming has limited marginal value because the classifier could simply be used to filter training data or model outputs. In this work, we consider red teaming under the assumption that the adversary is working from a high-level, abstract specification of undesired behavior. The red team is expected to refine/extend this specification and identify methods to elicit this behavior from the model. Our red teaming framework consists of three steps: 1) Exploring the model's behavior in the desired context; 2) Establishing a measurement of undesired behavior (e.g., a classifier trained to reflect human evaluations); and 3) Exploiting the model's flaws using this measure and an established red teaming methodology. We apply this approach to red team GPT-2 and GPT-3 models to systematically discover classes of prompts that elicit toxic and dishonest statements. In doing so, we also construct and release the CommonClaim dataset of 20,000 statements that have been labeled by human subjects as common-knowledge-true, common-knowledge-false, or neither. Code is available at https://github.com/thestephencasper/explore_establish_exploit_llms. CommonClaim is available at https://github.com/thestephencasper/common_claim.

探索、建立、利用：從頭開始對語言模型進行紅隊測試

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

摘要

Support