I grandi modelli linguistici generano contenuti dannosi utilizzando un meccanismo distinto e unificato

Abstract

I grandi modelli linguistici (LLM) vengono sottoposti ad addestramento di allineamento per evitare comportamenti dannosi, eppure le relative salvaguardie risultano fragili: i jailbreak le aggirano regolarmente e il fine-tuning su domini ristretti può indurre un "disallineamento emergente" che si generalizza ampiamente. Resta poco chiaro se questa fragilità rifletta una fondamentale mancanza di un'organizzazione interna coerente per la dannosità. Qui utilizziamo il pruning mirato dei pesi come intervento causale per investigare l'organizzazione interna della dannosità negli LLM. Troviamo che la generazione di contenuti dannosi dipende da un insieme compatto di pesi che sono generali tra i tipi di danno e distinti dalle capacità benigne. I modelli allineati mostrano una maggiore compressione dei pesi per la generazione di danni rispetto alle controparti non allineate, indicando che l'allineamento rimodella internamente le rappresentazioni dannose – nonostante la fragilità delle protezioni di sicurezza a livello superficiale. Questa compressione spiega il disallineamento emergente: se i pesi delle capacità dannose sono compressi, un fine-tuning che coinvolge questi pesi in un dominio può innescare un ampio disallineamento. Coerentemente con ciò, il pruning dei pesi per la generazione di danni in un dominio ristretto riduce sostanzialmente il disallineamento emergente. È degno di nota il fatto che la capacità di generazione dannosa degli LLM sia dissociata da come essi riconoscono e spiegano tali contenuti. Nel complesso, questi risultati rivelano una struttura interna coerente per la dannosità negli LLM che potrebbe servire come base per approcci alla sicurezza più principiati.

English

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.

I grandi modelli linguistici generano contenuti dannosi utilizzando un meccanismo distinto e unificato

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Abstract

Support