揭示大型语言模型的安全漏洞

摘要

随着大型语言模型的普及，其可能产生有害或不当回应的问题引起了关注。本文介绍了一个独特的数据集，包含以问题形式的对抗样本，我们称之为AttaQ，旨在引发这种有害或不当回应。我们通过分析各种模型在面对该数据集时的漏洞来评估我们数据集的有效性。此外，我们提出了一种新颖的自动方法，用于识别和命名易受攻击的语义区域 - 模型可能产生有害输出的输入语义区域。通过应用专门的聚类技术，考虑输入攻击的语义相似性和模型响应的有害性，实现了这一目标。自动识别易受攻击的语义区域增强了对模型弱点的评估，有助于针对性地改进其安全机制和整体可靠性。

English

As large language models become more prevalent, their possible harmful or inappropriate responses are a cause for concern. This paper introduces a unique dataset containing adversarial examples in the form of questions, which we call AttaQ, designed to provoke such harmful or inappropriate responses. We assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subjected to it. Additionally, we introduce a novel automatic approach for identifying and naming vulnerable semantic regions - input semantic areas for which the model is likely to produce harmful outputs. This is achieved through the application of specialized clustering techniques that consider both the semantic similarity of the input attacks and the harmfulness of the model's responses. Automatically identifying vulnerable semantic regions enhances the evaluation of model weaknesses, facilitating targeted improvements to its safety mechanisms and overall reliability.

揭示大型语言模型的安全漏洞

Unveiling Safety Vulnerabilities of Large Language Models

摘要

Support