知识的诞生：大语言模型跨时空与尺度的涌现特征

摘要

本研究探讨了大规模语言模型（LLMs）中可解释类别特征的出现规律，分析了这些特征在训练检查点（时间维度）、Transformer层（空间维度）以及不同模型规模（尺度维度）上的表现。通过使用稀疏自编码器进行机制解释性分析，我们识别了特定语义概念在神经激活中的出现时机与位置。研究结果表明，在多个领域中，特征的出现存在明确的时间与规模阈值。特别值得注意的是，空间分析揭示了意外的语义再激活现象，即早期层的特征在后续层中重新出现，这一发现对Transformer模型中表征动态的标准假设提出了挑战。

English

This paper studies the emergence of interpretable categorical features within large language models (LLMs), analyzing their behavior across training checkpoints (time), transformer layers (space), and varying model sizes (scale). Using sparse autoencoders for mechanistic interpretability, we identify when and where specific semantic concepts emerge within neural activations. Results indicate clear temporal and scale-specific thresholds for feature emergence across multiple domains. Notably, spatial analysis reveals unexpected semantic reactivation, with early-layer features re-emerging at later layers, challenging standard assumptions about representational dynamics in transformer models.

知识的诞生：大语言模型跨时空与尺度的涌现特征

The Birth of Knowledge: Emergent Features across Time, Space, and Scale in Large Language Models

摘要

Support