揭示大型語言模型的安全漏洞
Unveiling Safety Vulnerabilities of Large Language Models
November 7, 2023
作者: George Kour, Marcel Zalmanovici, Naama Zwerdling, Esther Goldbraich, Ora Nova Fandina, Ateret Anaby-Tavor, Orna Raz, Eitan Farchi
cs.AI
摘要
隨著大型語言模型變得更加普及,其可能帶來有害或不當回應的問題引起了關注。本文介紹了一個獨特的數據集,其中包含以問題形式的對抗性示例,我們稱之為AttaQ,旨在引發此類有害或不當回應。我們通過分析各種模型在受到此數據集影響時的弱點來評估我們數據集的有效性。此外,我們引入了一種新穎的自動方法,用於識別和命名易受攻擊的語義區域 - 模型可能會產生有害輸出的輸入語義區域。這是通過應用專門的聚類技術實現的,該技術考慮了輸入攻擊的語義相似性和模型回應的有害性。自動識別易受攻擊的語義區域有助於評估模型的弱點,促進針對性地改進其安全機制和整體可靠性。
English
As large language models become more prevalent, their possible harmful or
inappropriate responses are a cause for concern. This paper introduces a unique
dataset containing adversarial examples in the form of questions, which we call
AttaQ, designed to provoke such harmful or inappropriate responses. We assess
the efficacy of our dataset by analyzing the vulnerabilities of various models
when subjected to it. Additionally, we introduce a novel automatic approach for
identifying and naming vulnerable semantic regions - input semantic areas for
which the model is likely to produce harmful outputs. This is achieved through
the application of specialized clustering techniques that consider both the
semantic similarity of the input attacks and the harmfulness of the model's
responses. Automatically identifying vulnerable semantic regions enhances the
evaluation of model weaknesses, facilitating targeted improvements to its
safety mechanisms and overall reliability.