Transformer模型收敛于不变算法核心

摘要

大型语言模型展现出复杂精妙的能力，但其内部运作机制仍是核心挑战。根本性障碍在于：训练过程筛选的是行为表现而非电路结构，因此多种权重配置均可实现相同功能。哪些内部结构反映真实计算过程，哪些只是特定训练过程的偶然产物？本研究提取出算法核心——即任务执行所必需且充分的紧凑子空间。独立训练的Transformer模型虽学习到不同权重，但最终收敛至相同的核心结构。马尔可夫链Transformer将三维核心嵌入近乎正交的子空间，却能复现完全相同的转移谱；模加运算Transformer在顿悟期发现紧凑循环算子，后期发生膨胀，由此建立从记忆到泛化转变的预测模型。GPT-2语言模型通过单一轴线控制主谓一致，翻转该轴线即可在全尺度生成过程中反转语法数。这些结果揭示了跨越不同训练过程与模型尺度的低维不变量，表明Transformer计算本质上是围绕紧凑共享的算法结构组织的。针对此类计算本质（而非实现细节）的不变量进行研究，或将推动机械可解释性领域的突破。

English

Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obstacle is that training selects for behavior, not circuitry, so many weight configurations can implement the same function. Which internal structures reflect the computation, and which are accidents of a particular training run? This work extracts algorithmic cores: compact subspaces necessary and sufficient for task performance. Independently trained transformers learn different weights but converge to the same cores. Markov-chain transformers embed 3D cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers discover compact cyclic operators at grokking that later inflate, yielding a predictive model of the memorization-to-generalization transition. GPT-2 language models govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number throughout generation across scales. These results reveal low-dimensional invariants that persist across training runs and scales, suggesting that transformer computations are organized around compact, shared algorithmic structures. Mechanistic interpretability could benefit from targeting such invariants -- the computational essence -- rather than implementation-specific details.

Transformer模型收敛于不变算法核心

Transformers converge to invariant algorithmic cores

摘要

Support