탐색, 확립, 활용: 언어 모델에 대한 레드 팀 접근법의 처음부터 시작하기

초록

대형 언어 모델(LLMs)을 배포할 때 유해한 출력물, 예를 들어 독성이 있거나 부정직한 발언 등으로 인한 위험이 발생할 수 있습니다. 기존 연구에서는 이러한 위험을 식별하고 완화하기 위해 유해한 출력물을 유도하는 도구들을 소개했습니다. 이는 언어 모델을 보호하기 위한 중요한 단계이지만, 이러한 접근 방식은 일반적으로 원치 않는 출력물을 분류하기 위해 사전에 존재하는 분류기에 의존합니다. 이는 유해 행동의 유형이 정확히 사전에 알려진 상황으로만 그 적용이 제한된다는 것을 의미합니다. 그러나 이는 레드 팀(red teaming)의 핵심 과제인 모델이 보여줄 수 있는 행동에 대한 맥락적 이해를 개발하는 과정을 건너뛰게 됩니다. 더욱이, 이러한 분류기가 이미 존재할 경우, 레드 팀은 한계적인 가치만을 지니게 되는데, 분류기를 단순히 훈련 데이터나 모델 출력물을 필터링하는 데 사용할 수 있기 때문입니다. 본 연구에서는 적대자가 원치 않는 행동에 대한 높은 수준의 추상적 명세를 기반으로 작업한다는 가정 하에 레드 팀을 고려합니다. 레드 팀은 이 명세를 정제/확장하고 모델로부터 이러한 행동을 유도하는 방법을 식별할 것으로 기대됩니다. 우리의 레드 팀 프레임워크는 세 단계로 구성됩니다: 1) 원하는 맥락에서 모델의 행동을 탐색; 2) 원치 않는 행동에 대한 측정 기준 설정(예: 인간 평가를 반영하도록 훈련된 분류기); 3) 이 측정 기준과 확립된 레드 팀 방법론을 사용하여 모델의 결함을 활용. 우리는 이 접근법을 GPT-2 및 GPT-3 모델에 적용하여 독성이 있거나 부정직한 발언을 유도하는 프롬프트의 유형을 체계적으로 발견했습니다. 이를 통해 우리는 또한 20,000개의 진술로 구성된 CommonClaim 데이터셋을 구축하고 공개했습니다. 이 데이터셋은 인간 피험자에 의해 일반 지식-참, 일반 지식-거짓, 또는 둘 다 아닌 것으로 레이블이 지정되었습니다. 코드는 https://github.com/thestephencasper/explore_establish_exploit_llms에서 확인할 수 있으며, CommonClaim은 https://github.com/thestephencasper/common_claim에서 확인할 수 있습니다.

English

Deploying Large language models (LLMs) can pose hazards from harmful outputs such as toxic or dishonest speech. Prior work has introduced tools that elicit harmful outputs in order to identify and mitigate these risks. While this is a valuable step toward securing language models, these approaches typically rely on a pre-existing classifier for undesired outputs. This limits their application to situations where the type of harmful behavior is known with precision beforehand. However, this skips a central challenge of red teaming: developing a contextual understanding of the behaviors that a model can exhibit. Furthermore, when such a classifier already exists, red teaming has limited marginal value because the classifier could simply be used to filter training data or model outputs. In this work, we consider red teaming under the assumption that the adversary is working from a high-level, abstract specification of undesired behavior. The red team is expected to refine/extend this specification and identify methods to elicit this behavior from the model. Our red teaming framework consists of three steps: 1) Exploring the model's behavior in the desired context; 2) Establishing a measurement of undesired behavior (e.g., a classifier trained to reflect human evaluations); and 3) Exploiting the model's flaws using this measure and an established red teaming methodology. We apply this approach to red team GPT-2 and GPT-3 models to systematically discover classes of prompts that elicit toxic and dishonest statements. In doing so, we also construct and release the CommonClaim dataset of 20,000 statements that have been labeled by human subjects as common-knowledge-true, common-knowledge-false, or neither. Code is available at https://github.com/thestephencasper/explore_establish_exploit_llms. CommonClaim is available at https://github.com/thestephencasper/common_claim.

탐색, 확립, 활용: 언어 모델에 대한 레드 팀 접근법의 처음부터 시작하기

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

초록

Support