Esplorare, Stabilire, Sfruttare: Red Teaming dei Modelli Linguistici da Zero

Abstract

L’impiego di modelli linguistici di grandi dimensioni (LLM) può comportare rischi legati a output dannosi, come discorsi tossici o disonesti. Ricerche precedenti hanno introdotto strumenti che inducono output dannosi al fine di identificare e mitigare tali rischi. Sebbene si tratti di un passo valido verso la messa in sicurezza dei modelli linguistici, questi approcci si basano tipicamente su un classificatore preesistente per gli output indesiderati. Ciò ne limita l’applicazione a situazioni in cui il tipo di comportamento dannoso è noto con precisione in anticipo. Tuttavia, questo trascura una sfida centrale del red teaming: sviluppare una comprensione contestuale dei comportamenti che un modello può manifestare. Inoltre, quando un tale classificatore esiste già, il red teaming ha un valore marginale limitato, poiché il classificatore potrebbe essere utilizzato semplicemente per filtrare i dati di addestramento o gli output del modello. In questo lavoro, consideriamo il red teaming nell’ipotesi che l’avversario operi partendo da una specificazione di alto livello e astratta di comportamento indesiderato. Il red team è chiamato a perfezionare/estendere questa specificazione e a identificare metodi per indurre tale comportamento nel modello. Il nostro framework di red teaming si compone di tre passaggi: 1) Esplorare il comportamento del modello nel contesto desiderato; 2) Stabilire una misura del comportamento indesiderato (ad esempio, un classificatore addestrato per riflettere valutazioni umane); e 3) Sfruttare le vulnerabilità del modello utilizzando questa misura e una metodologia di red teaming consolidata. Applichiamo questo approccio per eseguire il red teaming dei modelli GPT-2 e GPT-3, scoprendo sistematicamente classi di prompt che inducono affermazioni tossiche e disoneste. Nel farlo, costruiamo e rilasciamo anche il dataset CommonClaim, composto da 20.000 affermazioni etichettate da soggetti umani come verità di conoscenza comune, falsità di conoscenza comune o né l’una né l’altra. Il codice è disponibile all’indirizzo https://github.com/thestephencasper/explore_establish_exploit_llms. CommonClaim è disponibile all’indirizzo https://github.com/thestephencasper/common_claim.

English

Deploying Large language models (LLMs) can pose hazards from harmful outputs such as toxic or dishonest speech. Prior work has introduced tools that elicit harmful outputs in order to identify and mitigate these risks. While this is a valuable step toward securing language models, these approaches typically rely on a pre-existing classifier for undesired outputs. This limits their application to situations where the type of harmful behavior is known with precision beforehand. However, this skips a central challenge of red teaming: developing a contextual understanding of the behaviors that a model can exhibit. Furthermore, when such a classifier already exists, red teaming has limited marginal value because the classifier could simply be used to filter training data or model outputs. In this work, we consider red teaming under the assumption that the adversary is working from a high-level, abstract specification of undesired behavior. The red team is expected to refine/extend this specification and identify methods to elicit this behavior from the model. Our red teaming framework consists of three steps: 1) Exploring the model's behavior in the desired context; 2) Establishing a measurement of undesired behavior (e.g., a classifier trained to reflect human evaluations); and 3) Exploiting the model's flaws using this measure and an established red teaming methodology. We apply this approach to red team GPT-2 and GPT-3 models to systematically discover classes of prompts that elicit toxic and dishonest statements. In doing so, we also construct and release the CommonClaim dataset of 20,000 statements that have been labeled by human subjects as common-knowledge-true, common-knowledge-false, or neither. Code is available at https://github.com/thestephencasper/explore_establish_exploit_llms. CommonClaim is available at https://github.com/thestephencasper/common_claim.

Esplorare, Stabilire, Sfruttare: Red Teaming dei Modelli Linguistici da Zero

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

Abstract

Support