揭示大型語言模型的安全漏洞

摘要

隨著大型語言模型變得更加普及，其可能帶來有害或不當回應的問題引起了關注。本文介紹了一個獨特的數據集，其中包含以問題形式的對抗性示例，我們稱之為AttaQ，旨在引發此類有害或不當回應。我們通過分析各種模型在受到此數據集影響時的弱點來評估我們數據集的有效性。此外，我們引入了一種新穎的自動方法，用於識別和命名易受攻擊的語義區域 - 模型可能會產生有害輸出的輸入語義區域。這是通過應用專門的聚類技術實現的，該技術考慮了輸入攻擊的語義相似性和模型回應的有害性。自動識別易受攻擊的語義區域有助於評估模型的弱點，促進針對性地改進其安全機制和整體可靠性。

English

As large language models become more prevalent, their possible harmful or inappropriate responses are a cause for concern. This paper introduces a unique dataset containing adversarial examples in the form of questions, which we call AttaQ, designed to provoke such harmful or inappropriate responses. We assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subjected to it. Additionally, we introduce a novel automatic approach for identifying and naming vulnerable semantic regions - input semantic areas for which the model is likely to produce harmful outputs. This is achieved through the application of specialized clustering techniques that consider both the semantic similarity of the input attacks and the harmfulness of the model's responses. Automatically identifying vulnerable semantic regions enhances the evaluation of model weaknesses, facilitating targeted improvements to its safety mechanisms and overall reliability.

揭示大型語言模型的安全漏洞

Unveiling Safety Vulnerabilities of Large Language Models

摘要

Support