大型语言模型通过独特统一机制生成有害内容

摘要

大型语言模型（LLM）经过对齐训练以避免有害行为，但由此产生的安全防护仍显脆弱：越狱攻击常能绕过防护，而针对特定领域的微调可能引发广泛泛化的"突发错位"现象。这种脆弱性是否反映了模型内部缺乏对有害性的连贯组织结构尚不明确。本研究通过定向权重剪裁作为因果干预手段，探究LLM内部有害性的组织机制。我们发现有害内容生成依赖于一组跨危害类型通用、且与良性能力相区分的紧凑权重。对齐模型相较于未对齐模型表现出更强的有害生成权重压缩特性，表明对齐过程在内部重构了有害表征——尽管表层安全防护存在脆弱性。这种压缩现象解释了突发错位：若有害能力权重被压缩，在某一领域微调激活这些权重可能引发广泛错位。与此一致的是，在特定领域剪裁有害生成权重可显著缓解突发错位。值得注意的是，LLM的有害生成能力与其识别解释此类内容的能力存在解耦。这些发现共同揭示了LLM内部存在连贯的有害性组织结构，或可为构建更系统的安全方法奠定基础。

English

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.

大型语言模型通过独特统一机制生成有害内容

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

摘要

Support