大規模言語モデルは、有害なコンテンツを生成する際に独自の統一されたメカニズムを使用する

要旨

大規模言語モデル（LLM）は有害な行動を回避するためにアライメント調整が施されるが、その結果としての安全策は脆いままである：ジェイルブレイクによって日常的に迂回され、特定領域へのファインチューニングは広範に一般化する「創発的非整合性」を誘発しうる。この脆弱性が、有害性に関する首尾一貫した内部構造の根本的な欠如を反映しているのか否かは不明である。本研究では、因果的介入としてのターゲットを絞った重み刈り込みを用いて、LLM内の有害性の内部構造を探る。我々は、有害なコンテンツ生成が、有害性のタイプを超えて共通かつ健全な能力から分離された、コンパクトな重みの集合に依存することを発見した。アライメント済みモデルは、未調整モデルに比べて有害生成重みのより大きな圧縮を示し、これは表面レベルの安全策の脆弱性にも関わらず、アライメントが内部的に有害な表現を再形成していることを示唆する。この圧縮は創発的非整合性を説明する：有害能力の重みが圧縮されている場合、一つの領域でこれらの重みを活性化するファインチューニングは、広範な非整合性を引き起こしうる。これと一致して、特定領域の有害生成重みを刈り込むことで、創発的非整合性が大幅に減少する。特筆すべきは、LLMの有害生成能力は、そのようなコンテンツを認識し説明する能力から乖離していることである。これらの結果は、より原理に基づいた安全アプローチの基盤となりうる、LLM内の有害性に関する首尾一貫した内部構造を明らかにするものである。

English

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.

大規模言語モデルは、有害なコンテンツを生成する際に独自の統一されたメカニズムを使用する

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

要旨

Support