超越词嵌入的涌现语义:采用冻结视觉Unicode表示的Transformer语言模型
Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations
July 7, 2025
作者: A. Bochkov
cs.AI
摘要
理解大型語言模型(LLMs)中語義表徵的所在,對於模型的可解釋性與架構創新至關重要。主流觀點認為,可訓練的輸入嵌入層是基礎的「意義向量」。本文對此觀點提出挑戰。我們構建了Transformer模型,其中嵌入層完全凍結,其向量並非源自數據,而是基於Unicode字符的視覺結構。這些非語義的、預先計算的視覺嵌入在整個訓練過程中保持固定。我們的方法兼容於任何分詞器,包括我們引入的一種新型以Unicode為核心的分詞器,以確保對所有文本的覆蓋。儘管缺乏可訓練的、語義初始化的嵌入,我們的模型仍能收斂,生成連貫的文本,並且關鍵的是,在MMLU推理基準測試中,其表現超越了架構相同但具有可訓練嵌入的模型。我們將此歸因於傳統模型中的「表徵干擾」,即嵌入層被迫同時學習結構與語義特徵。我們的結果表明,高層次的語義並非輸入嵌入的固有屬性,而是Transformer組合架構與數據規模的湧現特性。這重新定義了嵌入的角色,從意義的容器轉變為結構的基礎單元。我們公開所有代碼與模型,以促進進一步的研究。
English
Understanding the locus of semantic representation in large language models
(LLMs) is crucial for interpretability and architectural innovation. The
dominant paradigm posits that trainable input embeddings serve as foundational
"meaning vectors." This paper challenges that view. We construct Transformer
models where the embedding layer is entirely frozen, with vectors derived not
from data, but from the visual structure of Unicode glyphs. These non-semantic,
precomputed visual embeddings are fixed throughout training. Our method is
compatible with any tokenizer, including a novel Unicode-centric tokenizer we
introduce to ensure universal text coverage. Despite the absence of trainable,
semantically initialized embeddings, our models converge, generate coherent
text, and, critically, outperform architecturally identical models with
trainable embeddings on the MMLU reasoning benchmark. We attribute this to
"representational interference" in conventional models, where the embedding
layer is burdened with learning both structural and semantic features. Our
results indicate that high-level semantics are not inherent to input embeddings
but are an emergent property of the Transformer's compositional architecture
and data scale. This reframes the role of embeddings from meaning containers to
structural primitives. We release all code and models to foster further
research.