대규모 언어 모델은 독특하고 통합된 메커니즘을 통해 유해한 콘텐츠를 생성합니다

초록

대규모 언어 모델(LLM)은 유해한 행동을 피하기 위해 정렬 훈련을 거치지만, 그 결과물인 안전 장치는 여전히 취약합니다: 재킹브레이크(jailbreak)가 이를 정기적으로 우회하며, 특정 영역에 대한 미세 조정은 폭넓게 일반화되는 '발생적 오정렬(emergent misalignment)'을 유발할 수 있습니다. 이러한 취약성이 유해성에 대한 근본적인 내부 조직의 일관성 부족을 반영하는지 여부는 여전히 불분명합니다. 본 연구에서는 LLM 내 유해성의 내부 조직을 탐색하기 위한 인과적 개입으로 표적 가중치 프루닝(pruning)을 사용합니다. 우리는 유해한 콘텐츠 생성이 유해 유형 전반에 걸쳐 공통적이며 유익한 능력과는 구별되는 소수의 컴팩트한 가중치 집합에 의존한다는 것을 발견했습니다. 정렬된 모델은 정렬되지 않은 모델보다 유해 생성 가중치의 압축 정도가 더 크게 나타나, 표면 수준의 안전 장치의 취약성에도 불구하고 정렬이 내부적으로 유해한 표현을 재구성함을 시사합니다. 이러한 압축은 발생적 오정렬을 설명합니다: 유해 능력의 가중치가 압축된 경우, 특정 영역에서 이 가중치들을 사용하는 미세 조정은 광범위한 오정렬을 촉발할 수 있습니다. 이와 일관되게, 특정 영역의 유해 생성 가중치를 제거(pruning)하면 발생적 오정렬이 상당히 감소합니다. 주목할 점은, LLM의 유해 콘텐츠 생성 능력이 그러한 콘텐츠를 인식하고 설명하는 방식과 분리되어 있다는 것입니다. 종합하면, 이러한 결과들은 LLM 내에 유해성에 대한 일관된 내부 구조가 존재함을 보여주며, 이는 보다 원칙적인 안전 접근법의 기초가 될 수 있습니다.

English

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.

대규모 언어 모델은 독특하고 통합된 메커니즘을 통해 유해한 콘텐츠를 생성합니다

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

초록

Support