LLMには政治的正確性がありますか？AIシステムにおける倫理的バイアスとジェイルブレイクの脆弱性を分析する

要旨

大規模言語モデル（LLM）は、さまざまなタスクで印象的な能力を示していますが、`ジェイルブレイク'などの潜在的な安全リスクがあります。悪意のある入力によってLLMが有害なコンテンツを生成するよう強制される可能性があります。これらの問題に対処するため、多くのLLM開発者が、これらのモデルを整列させるためにさまざまな安全対策を実装しています。この整列には、事前トレーニング中のデータフィルタリング、監督されたファインチューニング、人間からのフィードバックによる強化学習、およびレッドチーム演習など、いくつかの技術が関与しています。これらの方法は、しばしば倫理的な行動を確保するために、政治的正しさ（PC）に類似した意図的なバイアスを導入します。本論文では、安全性のためにLLMに注入される意図的なバイアスに焦点を当て、これらの安全整列技術を回避する方法を検討します。特に、これらの意図的なバイアスは、GPT-4oモデルにおいて、非バイナリとシスジェンダーキーワード間で20%、白人と黒人キーワード間で16%のジェイルブレイク成功率の違いをもたらします。他のプロンプトの部分が同一である場合でもです。我々は、PCJailbreakという概念を導入し、これらの安全性に起因するバイアスがもたらす固有のリスクを強調します。さらに、生成前に防御プロンプトを注入することでジェイルブレイクの試みを防ぐ効率的な防御方法PCDefenseを提案します。PCDefenseは、テキスト生成後に追加の推論コストが必要なLlama-Guardなどのガードモデルにとって魅力的な代替手段となります。我々の調査結果は、LLM開発者が安全対策の設計と実装においてより責任あるアプローチを採用する必要性を強調しています。

English

Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as `jailbreaks', where malicious inputs can coerce LLMs into generating harmful content. To address these issues, many LLM developers have implemented various safety measures to align these models. This alignment involves several techniques, including data filtering during pre-training, supervised fine-tuning, reinforcement learning from human feedback, and red-teaming exercises. These methods often introduce deliberate and intentional biases similar to Political Correctness (PC) to ensure the ethical behavior of LLMs. In this paper, we delve into the intentional biases injected into LLMs for safety purposes and examine methods to circumvent these safety alignment techniques. Notably, these intentional biases result in a jailbreaking success rate in GPT-4o models that differs by 20% between non-binary and cisgender keywords and by 16% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept of PCJailbreak, highlighting the inherent risks posed by these safety-induced biases. Additionally, we propose an efficient defense method PCDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation. PCDefense stands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize the urgent need for LLM developers to adopt a more responsible approach when designing and implementing safety measures.

LLMには政治的正確性がありますか？AIシステムにおける倫理的バイアスとジェイルブレイクの脆弱性を分析する

Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems

要旨

Support