大型语言模型通过独特统一机制生成有害内容

摘要

大型语言模型（LLMs）经过对齐训练以规避有害行为，但由此产生的安全防护机制仍显脆弱：越狱攻击常能轻易绕过这些防护，而在特定领域的微调可能引发广泛泛化的"突发错位"现象。这种脆弱性是否反映了模型内部缺乏对有害性的连贯内在组织机制，目前尚不明确。本研究采用定向权重剪枝作为因果干预手段，深入探究LLMs中有害性的内部组织架构。研究发现有害内容生成依赖于一组跨有害类型通用且区别于良性能力的紧凑权重集。相较于未对齐模型，对齐模型展现出更显著的有害生成权重压缩特征，这表明对齐过程在模型内部重构了有害表征——尽管表层安全防护存在脆弱性。这种压缩现象解释了突发错位的成因：若有害能力权重被压缩，在某一领域微调过程中激活这些权重可能引发广泛错位。与此一致的是，在特定领域剪除有害生成权重可显著缓解突发错位。值得注意的是，LLMs的有害生成能力与其识别和解释此类内容的能力存在解耦。这些发现共同揭示了LLMs内部存在连贯的有害性组织结构，或可为构建更系统化的安全方法奠定基础。

English

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.

大型语言模型通过独特统一机制生成有害内容

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

摘要

Support