LLM模型是否具有政治正确性?分析AI系统中的道德偏见和越狱漏洞。
Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems
October 17, 2024
作者: Isack Lee, Haebin Seong
cs.AI
摘要
尽管大型语言模型(LLMs)展示了在各种任务中出色的熟练度,但它们存在潜在的安全风险,比如“越狱”,即恶意输入可能迫使LLMs生成有害内容。为了解决这些问题,许多LLM开发者已经实施了各种安全措施来使这些模型保持一致。这种一致性涉及多种技术,包括在预训练期间进行数据过滤、监督微调、从人类反馈中进行强化学习以及红队演练。这些方法通常引入了类似政治正确性(PC)的蓄意偏见,以确保LLMs的道德行为。本文深入探讨了为了安全目的而注入LLMs的蓄意偏见,并研究了规避这些安全一致性技术的方法。值得注意的是,即使提示的其他部分相同,这些蓄意偏见导致GPT-4o模型中越狱成功率在非二进制和同性恋关键词之间相差20%,在白人和黑人关键词之间相差16%。我们引入了PCJailbreak的概念,突出了这些安全性引发的偏见所带来的固有风险。此外,我们提出了一种有效的防御方法PCDefense,通过在生成之前注入防御提示来防止越狱尝试。PCDefense作为一种吸引人的替代方案,与需要在文本生成后额外推理成本的Guard Models(如Llama-Guard)不同。我们的发现强调了LLM开发者在设计和实施安全措施时采取更负责任的方法的迫切需要。
English
Although large language models (LLMs) demonstrate impressive proficiency in
various tasks, they present potential safety risks, such as `jailbreaks', where
malicious inputs can coerce LLMs into generating harmful content. To address
these issues, many LLM developers have implemented various safety measures to
align these models. This alignment involves several techniques, including data
filtering during pre-training, supervised fine-tuning, reinforcement learning
from human feedback, and red-teaming exercises. These methods often introduce
deliberate and intentional biases similar to Political Correctness (PC) to
ensure the ethical behavior of LLMs. In this paper, we delve into the
intentional biases injected into LLMs for safety purposes and examine methods
to circumvent these safety alignment techniques. Notably, these intentional
biases result in a jailbreaking success rate in GPT-4o models that differs by
20% between non-binary and cisgender keywords and by 16% between white and
black keywords, even when the other parts of the prompts are identical. We
introduce the concept of PCJailbreak, highlighting the inherent risks posed by
these safety-induced biases. Additionally, we propose an efficient defense
method PCDefense, which prevents jailbreak attempts by injecting defense
prompts prior to generation. PCDefense stands as an appealing alternative to
Guard Models, such as Llama-Guard, that require additional inference cost after
text generation. Our findings emphasize the urgent need for LLM developers to
adopt a more responsible approach when designing and implementing safety
measures.Summary
AI-Generated Summary