레인보우 팀잉: 다양한 적대적 프롬프트의 개방형 생성

초록

대규모 언어 모델(LLM)이 다양한 실제 응용 분야에서 점점 더 널리 사용됨에 따라, 사용자 입력에 대한 모델의 견고성을 이해하고 향상시키는 것은 매우 중요한 과제가 되었습니다. 기존의 적대적 프롬프트 식별 방법은 특정 도메인에 집중하거나 다양성이 부족하며, 많은 경우 광범위한 인간 주석이 필요합니다. 이러한 한계를 해결하기 위해, 본 논문에서는 다양한 적대적 프롬프트를 생성하는 새로운 접근법인 Rainbow Teaming을 제안합니다. Rainbow Teaming은 적대적 프롬프트 생성을 품질-다양성 문제로 정의하고, 개방형 탐색을 통해 효과적이면서도 다양한 프롬프트를 생성합니다. 이 방법은 안전성, 질의응답, 사이버 보안 등 다양한 도메인에서 모델의 취약점을 발견할 수 있습니다. 또한, Rainbow Teaming으로 생성된 합성 데이터를 활용한 미세 조정이 최신 LLM의 안전성을 개선하면서도 일반적인 성능과 유용성을 저해하지 않음을 보여줌으로써, 개방형 자기 개선의 길을 열어줍니다.

English

As large language models (LLMs) become increasingly prevalent across many real-world applications, understanding and enhancing their robustness to user inputs is of paramount importance. Existing methods for identifying adversarial prompts tend to focus on specific domains, lack diversity, or require extensive human annotations. To address these limitations, we present Rainbow Teaming, a novel approach for producing a diverse collection of adversarial prompts. Rainbow Teaming casts adversarial prompt generation as a quality-diversity problem, and uses open-ended search to generate prompts that are both effective and diverse. It can uncover a model's vulnerabilities across a broad range of domains including, in this paper, safety, question answering, and cybersecurity. We also demonstrate that fine-tuning on synthetic data generated by Rainbow Teaming improves the safety of state-of-the-art LLMs without hurting their general capabilities and helpfulness, paving the path to open-ended self-improvement.

레인보우 팀잉: 다양한 적대적 프롬프트의 개방형 생성

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

초록

Support