超越词嵌入的涌现语义：基于冻结视觉Unicode表示的Transformer语言模型

摘要

理解大型语言模型（LLMs）中语义表征的所在，对于模型的可解释性和架构创新至关重要。主流观点认为，可训练的输入嵌入层是基础的“意义向量”。本文对这一观点提出挑战。我们构建了Transformer模型，其中嵌入层完全冻结，向量并非源自数据，而是基于Unicode字符的视觉结构生成。这些非语义的、预先计算的视觉嵌入在训练过程中保持不变。我们的方法兼容任何分词器，包括我们引入的一种新型以Unicode为中心的分词器，以确保对文本的全面覆盖。尽管缺乏可训练的、语义初始化的嵌入，我们的模型仍能收敛，生成连贯的文本，并且关键的是，在MMLU推理基准测试中，超越了架构相同但嵌入层可训练的模型。我们将此归因于传统模型中的“表征干扰”，即嵌入层同时承担了学习结构和语义特征的重任。我们的结果表明，高级语义并非输入嵌入的固有属性，而是Transformer组合架构和数据规模下涌现的特性。这重新定义了嵌入的角色，从意义的容器转变为结构的基本单元。我们公开所有代码和模型，以促进进一步研究。

English

Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational "meaning vectors." This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to "representational interference" in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer's compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.

超越词嵌入的涌现语义：基于冻结视觉Unicode表示的Transformer语言模型

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

摘要

Support