LASA: Taalonafhankelijke Semantische Uitlijning aan het Semantische Knelpunt voor LLM-Veiligheid

Samenvatting

Grote taalmodellen (LLM's) vertonen vaak sterke veiligheidsprestaties in hoog-resource talen, maar tonen ernstige kwetsbaarheden wanneer ze worden bevraagd in laag-resource talen. Wij schrijven deze kloof toe aan een mismatch tussen het taal-agnostische semantische begripsvermogen en de taal-dominante veiligheidsafstemming die bevooroordeeld is ten gunste van hoog-resource talen. In overeenstemming met deze hypothese identificeren we empirisch de semantische bottleneck in LLM's: een tussenlaag waarin de geometrie van modelrepresentaties primair wordt bepaald door gedeelde semantische inhoud in plaats van taalidentiteit. Voortbouwend op deze observatie stellen we Language-Agnostic Semantic Alignment (LASA) voor, die de veiligheidsafstemming direct verankert in semantische bottlenecks. Experimenten tonen aan dat LASA de veiligheid aanzienlijk verbetert in alle talen: het gemiddelde aanvalsuccespercentage (ASR) daalt van 24,7% naar 2,8% op LLaMA-3.1-8B-Instruct en blijft rond de 3-4% voor Qwen2.5- en Qwen3-Instruct-modellen (7B-32B). Samen bieden onze analyse en methode een representatieniveau-perspectief op LLM-veiligheid, wat suggereert dat veiligheidsafstemming het veiligheidsbegrip niet moet verankeren in oppervlaktetekst, maar in de taal-agnostische semantische ruimte van het model.

English

Large language models (LLMs) often demonstrate strong safety performance in high-resource languages, yet exhibit severe vulnerabilities when queried in low-resource languages. We attribute this gap to a mismatch between language-agnostic semantic understanding ability and language-dominant safety alignment biased toward high-resource languages. Consistent with this hypothesis, we empirically identify the semantic bottleneck in LLMs, an intermediate layer in which the geometry of model representations is governed primarily by shared semantic content rather than language identity. Building on this observation, we propose Language-Agnostic Semantic Alignment (LASA), which anchors safety alignment directly in semantic bottlenecks. Experiments show that LASA substantially improves safety across all languages: average attack success rate (ASR) drops from 24.7% to 2.8% on LLaMA-3.1-8B-Instruct and remains around 3-4% across Qwen2.5 and Qwen3 Instruct models (7B-32B). Together, our analysis and method offer a representation-level perspective on LLM safety, suggesting that safety alignment requires anchoring safety understanding not in surface text, but in the model's language-agnostic semantic space.

LASA: Taalonafhankelijke Semantische Uitlijning aan het Semantische Knelpunt voor LLM-Veiligheid

LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

Samenvatting

Support