大規模言語モデルの安全性に関する脆弱性の解明

要旨

大規模言語モデルが普及するにつれ、それらが引き起こす可能性のある有害または不適切な応答が懸念材料となっています。本論文では、そのような有害または不適切な応答を誘発することを目的とした、質問形式の敵対的例を含む独自のデータセット「AttaQ」を紹介します。我々は、このデータセットを用いて様々なモデルの脆弱性を分析し、その有効性を評価します。さらに、モデルが有害な出力を生成しやすい入力意味領域（脆弱な意味領域）を特定し命名するための新しい自動手法を提案します。これは、入力攻撃の意味的類似性とモデルの応答の有害性の両方を考慮した特殊なクラスタリング技術を適用することで実現されます。脆弱な意味領域を自動的に特定することで、モデルの弱点評価が強化され、その安全性メカニズムと全体的な信頼性を対象的に改善することが容易になります。

English

As large language models become more prevalent, their possible harmful or inappropriate responses are a cause for concern. This paper introduces a unique dataset containing adversarial examples in the form of questions, which we call AttaQ, designed to provoke such harmful or inappropriate responses. We assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subjected to it. Additionally, we introduce a novel automatic approach for identifying and naming vulnerable semantic regions - input semantic areas for which the model is likely to produce harmful outputs. This is achieved through the application of specialized clustering techniques that consider both the semantic similarity of the input attacks and the harmfulness of the model's responses. Automatically identifying vulnerable semantic regions enhances the evaluation of model weaknesses, facilitating targeted improvements to its safety mechanisms and overall reliability.

大規模言語モデルの安全性に関する脆弱性の解明

Unveiling Safety Vulnerabilities of Large Language Models

要旨

Support