Explorar, Establecer, Explotar: Evaluación de Modelos de Lenguaje mediante Equipos Rojos desde Cero

Resumen

El despliegue de modelos de lenguaje grandes (LLMs, por sus siglas en inglés) puede presentar riesgos debido a la generación de salidas dañinas, como discursos tóxicos o deshonestos. Trabajos previos han introducido herramientas que provocan salidas dañinas con el fin de identificar y mitigar estos riesgos. Si bien este es un paso valioso para asegurar los modelos de lenguaje, estos enfoques suelen depender de un clasificador preexistente para identificar salidas no deseadas. Esto limita su aplicación a situaciones en las que el tipo de comportamiento dañino se conoce con precisión de antemano. Sin embargo, esto omite un desafío central de las pruebas de red teaming: desarrollar una comprensión contextual de los comportamientos que un modelo puede exhibir. Además, cuando ya existe un clasificador de este tipo, el red teaming tiene un valor marginal limitado, ya que el clasificador podría simplemente usarse para filtrar datos de entrenamiento o salidas del modelo. En este trabajo, consideramos el red teaming bajo el supuesto de que el adversario opera a partir de una especificación abstracta y de alto nivel de comportamiento no deseado. Se espera que el equipo de red teaming refine/extienda esta especificación e identifique métodos para provocar este comportamiento en el modelo. Nuestro marco de red teaming consta de tres pasos: 1) Explorar el comportamiento del modelo en el contexto deseado; 2) Establecer una medida del comportamiento no deseado (por ejemplo, un clasificador entrenado para reflejar evaluaciones humanas); y 3) Explotar las fallas del modelo utilizando esta medida y una metodología de red teaming establecida. Aplicamos este enfoque para realizar pruebas de red teaming en los modelos GPT-2 y GPT-3, descubriendo sistemáticamente clases de indicaciones que provocan declaraciones tóxicas y deshonestas. Al hacerlo, también construimos y publicamos el conjunto de datos CommonClaim, que contiene 20,000 declaraciones etiquetadas por sujetos humanos como verdades de conocimiento común, falsedades de conocimiento común o ninguna de las dos. El código está disponible en https://github.com/thestephencasper/explore_establish_exploit_llms. CommonClaim está disponible en https://github.com/thestephencasper/common_claim.

English

Deploying Large language models (LLMs) can pose hazards from harmful outputs such as toxic or dishonest speech. Prior work has introduced tools that elicit harmful outputs in order to identify and mitigate these risks. While this is a valuable step toward securing language models, these approaches typically rely on a pre-existing classifier for undesired outputs. This limits their application to situations where the type of harmful behavior is known with precision beforehand. However, this skips a central challenge of red teaming: developing a contextual understanding of the behaviors that a model can exhibit. Furthermore, when such a classifier already exists, red teaming has limited marginal value because the classifier could simply be used to filter training data or model outputs. In this work, we consider red teaming under the assumption that the adversary is working from a high-level, abstract specification of undesired behavior. The red team is expected to refine/extend this specification and identify methods to elicit this behavior from the model. Our red teaming framework consists of three steps: 1) Exploring the model's behavior in the desired context; 2) Establishing a measurement of undesired behavior (e.g., a classifier trained to reflect human evaluations); and 3) Exploiting the model's flaws using this measure and an established red teaming methodology. We apply this approach to red team GPT-2 and GPT-3 models to systematically discover classes of prompts that elicit toxic and dishonest statements. In doing so, we also construct and release the CommonClaim dataset of 20,000 statements that have been labeled by human subjects as common-knowledge-true, common-knowledge-false, or neither. Code is available at https://github.com/thestephencasper/explore_establish_exploit_llms. CommonClaim is available at https://github.com/thestephencasper/common_claim.

Explorar, Establecer, Explotar: Evaluación de Modelos de Lenguaje mediante Equipos Rojos desde Cero

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

Resumen

Support