探索、建立、利用:從頭開始對語言模型進行紅隊測試
Explore, Establish, Exploit: Red Teaming Language Models from Scratch
June 15, 2023
作者: Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell
cs.AI
摘要
部署大型語言模型(LLMs)可能存在來自有害輸出的危險,例如有毒或不誠實的言論。先前的工作引入了工具,以引出有害輸出,以便識別和減輕這些風險。儘管這是確保語言模型安全的一個寶貴步驟,但這些方法通常依賴於現有的用於不良輸出的分類器。這限制了它們僅適用於在事先精確知道有害行為類型的情況。然而,這忽略了紅隊測試的一個核心挑戰:發展對模型可能展現的行為的情境理解。此外,當這樣的分類器已經存在時,紅隊測試的邊際價值有限,因為可以簡單地使用分類器來過濾訓練數據或模型輸出。在這項工作中,我們考慮紅隊測試,假設對手是根據高層次、抽象的不良行為規範進行工作。預期紅隊將完善/擴展此規範並確定從模型中引出此行為的方法。我們的紅隊測試框架包括三個步驟:1)在所需上下文中探索模型的行為;2)建立不良行為的衡量(例如,訓練一個反映人類評估的分類器);和3)利用此衡量和已建立的紅隊測試方法來利用模型的缺陷。我們將此方法應用於紅隊測試 GPT-2 和 GPT-3 模型,以系統地發現引出有毒和不誠實陳述的提示類別。在此過程中,我們還構建並發布了由人類標記為常識真實、常識虛假或其他的 20,000 個陳述的 CommonClaim 數據集。代碼可在 https://github.com/thestephencasper/explore_establish_exploit_llms 找到。CommonClaim 可在 https://github.com/thestephencasper/common_claim 找到。
English
Deploying Large language models (LLMs) can pose hazards from harmful outputs
such as toxic or dishonest speech. Prior work has introduced tools that elicit
harmful outputs in order to identify and mitigate these risks. While this is a
valuable step toward securing language models, these approaches typically rely
on a pre-existing classifier for undesired outputs. This limits their
application to situations where the type of harmful behavior is known with
precision beforehand. However, this skips a central challenge of red teaming:
developing a contextual understanding of the behaviors that a model can
exhibit. Furthermore, when such a classifier already exists, red teaming has
limited marginal value because the classifier could simply be used to filter
training data or model outputs. In this work, we consider red teaming under the
assumption that the adversary is working from a high-level, abstract
specification of undesired behavior. The red team is expected to refine/extend
this specification and identify methods to elicit this behavior from the model.
Our red teaming framework consists of three steps: 1) Exploring the model's
behavior in the desired context; 2) Establishing a measurement of undesired
behavior (e.g., a classifier trained to reflect human evaluations); and 3)
Exploiting the model's flaws using this measure and an established red teaming
methodology. We apply this approach to red team GPT-2 and GPT-3 models to
systematically discover classes of prompts that elicit toxic and dishonest
statements. In doing so, we also construct and release the CommonClaim dataset
of 20,000 statements that have been labeled by human subjects as
common-knowledge-true, common-knowledge-false, or neither. Code is available at
https://github.com/thestephencasper/explore_establish_exploit_llms. CommonClaim
is available at https://github.com/thestephencasper/common_claim.