LASA: 의미적 병목 지점에서의 언어 중립적 의미 정렬을 통한 대규모 언어 모델 안전성 확보

초록

대규모 언어 모델(LLM)은 고자원 언어에서는 강력한 안전성 성능을 보이지만, 저자원 언어로 질의할 경우 심각한 취약점을 나타내는 경우가 많습니다. 우리는 이러한 격차가 언어에 구애받지 않는 의미 이해 능력과 고자원 언어로 편향된 언어 중심의 안전 조정 간의 불일치에서 비롯된다고 분석합니다. 이러한 가설과 일관되게, 우리는 모델 표현의 기하학적 구조가 언어 정체성보다는 공유된 의미 내용에 주로 의해 지배되는 중간 계층인 LLM의 의미적 병목 현상을 실증적으로 규명했습니다. 이러한 관찰을 바탕으로, 우리는 안전 조정을 의미적 병목 현상에 직접 고정하는 언어 독립적 의미 조정(LASA)을 제안합니다. 실험 결과 LASA는 모든 언어에서 안전성을 크게 향상시키는 것으로 나타났습니다: LLaMA-3.1-8B-Instruct에서 평균 공격 성공률(ASR)이 24.7%에서 2.8%로 떨어졌으며, Qwen2.5 및 Qwen3 Instruct 모델(7B-32B) 전반에서 약 3-4% 수준을 유지했습니다. 우리의 분석과 방법론은 LLM 안전성에 대한 표현 수준의 관점을 제시하며, 안전 조정이 표면적인 텍스트가 아닌 모델의 언어 독립적 의미 공간에 안전 이해를 고정해야 함을 시사합니다.

English

Large language models (LLMs) often demonstrate strong safety performance in high-resource languages, yet exhibit severe vulnerabilities when queried in low-resource languages. We attribute this gap to a mismatch between language-agnostic semantic understanding ability and language-dominant safety alignment biased toward high-resource languages. Consistent with this hypothesis, we empirically identify the semantic bottleneck in LLMs, an intermediate layer in which the geometry of model representations is governed primarily by shared semantic content rather than language identity. Building on this observation, we propose Language-Agnostic Semantic Alignment (LASA), which anchors safety alignment directly in semantic bottlenecks. Experiments show that LASA substantially improves safety across all languages: average attack success rate (ASR) drops from 24.7% to 2.8% on LLaMA-3.1-8B-Instruct and remains around 3-4% across Qwen2.5 and Qwen3 Instruct models (7B-32B). Together, our analysis and method offer a representation-level perspective on LLM safety, suggesting that safety alignment requires anchoring safety understanding not in surface text, but in the model's language-agnostic semantic space.

LASA: 의미적 병목 지점에서의 언어 중립적 의미 정렬을 통한 대규모 언어 모델 안전성 확보

LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

초록

Support