LASA: 意味的ボトルネックにおける言語非依存の意味的アライメントによるLLM安全性の実現

要旨

大規模言語モデル（LLM）は、高資源言語では高い安全性を示す一方、低資源言語で問い合わせると深刻な脆弱性が顕在化する。本研究では、この乖離を、言語に依存しない意味理解能力と、高資源言語に偏った言語依存的な安全性調整のミスマッチに帰因する。この仮説を支持する実証的証拠として、モデル表現の幾何構造が言語アイデンティティではなく共有の意味内容によって支配される中間層「意味的ボトルネック」を同定した。この知見に基づき、意味的ボトルネックに直接安全性調整を固定する言語非依存意味的調整（LASA）を提案する。実験結果では、LASAが全ての言語で安全性を大幅に向上させることを示す：LLaMA-3.1-8B-Instructでは平均攻撃成功率（ASR）が24.7%から2.8%に低下し、Qwen2.5及びQwen3 Instructモデル（7B-32B）でも3-4%前後に抑制された。本分析と手法は、LLMの安全性を表現レベルで捉える新たな視点を提供し、安全性調整が表層テキストではなく、モデルの言語非依存的な意味空間に基盤を置く必要性を示唆する。

English

Large language models (LLMs) often demonstrate strong safety performance in high-resource languages, yet exhibit severe vulnerabilities when queried in low-resource languages. We attribute this gap to a mismatch between language-agnostic semantic understanding ability and language-dominant safety alignment biased toward high-resource languages. Consistent with this hypothesis, we empirically identify the semantic bottleneck in LLMs, an intermediate layer in which the geometry of model representations is governed primarily by shared semantic content rather than language identity. Building on this observation, we propose Language-Agnostic Semantic Alignment (LASA), which anchors safety alignment directly in semantic bottlenecks. Experiments show that LASA substantially improves safety across all languages: average attack success rate (ASR) drops from 24.7% to 2.8% on LLaMA-3.1-8B-Instruct and remains around 3-4% across Qwen2.5 and Qwen3 Instruct models (7B-32B). Together, our analysis and method offer a representation-level perspective on LLM safety, suggesting that safety alignment requires anchoring safety understanding not in surface text, but in the model's language-agnostic semantic space.

LASA: 意味的ボトルネックにおける言語非依存の意味的アライメントによるLLM安全性の実現

LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

要旨

Support