대형 언어 모델의 안전성 취약점 해부

초록

대규모 언어 모델이 점점 더 보편화됨에 따라, 이들이 생성할 수 있는 유해하거나 부적절한 응답이 우려의 원인이 되고 있다. 본 논문은 이러한 유해하거나 부적절한 응답을 유발하도록 설계된 질문 형태의 적대적 예시를 포함한 독자적인 데이터셋인 AttaQ를 소개한다. 우리는 다양한 모델이 이 데이터셋에 노출되었을 때의 취약점을 분석함으로써 데이터셋의 효용성을 평가한다. 또한, 모델이 유해한 출력을 생성할 가능성이 높은 입력 의미 영역을 식별하고 명명하기 위한 새로운 자동화 접근 방식을 제안한다. 이는 입력 공격의 의미적 유사성과 모델 응답의 유해성을 모두 고려한 특수한 클러스터링 기법을 적용하여 달성된다. 취약한 의미 영역을 자동으로 식별함으로써 모델의 약점 평가가 강화되고, 그 안전 메커니즘과 전반적인 신뢰성에 대한 표적 개선이 용이해진다.

English

As large language models become more prevalent, their possible harmful or inappropriate responses are a cause for concern. This paper introduces a unique dataset containing adversarial examples in the form of questions, which we call AttaQ, designed to provoke such harmful or inappropriate responses. We assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subjected to it. Additionally, we introduce a novel automatic approach for identifying and naming vulnerable semantic regions - input semantic areas for which the model is likely to produce harmful outputs. This is achieved through the application of specialized clustering techniques that consider both the semantic similarity of the input attacks and the harmfulness of the model's responses. Automatically identifying vulnerable semantic regions enhances the evaluation of model weaknesses, facilitating targeted improvements to its safety mechanisms and overall reliability.

대형 언어 모델의 안전성 취약점 해부

Unveiling Safety Vulnerabilities of Large Language Models

초록

Support