Transformer模型收敛于不变算法核心
Transformers converge to invariant algorithmic cores
February 26, 2026
作者: Joshua S. Schiffman
cs.AI
摘要
大型语言模型展现出复杂精妙的能力,但其内部运作机制仍是核心挑战。根本性障碍在于:训练过程筛选的是行为表现而非电路结构,因此多种权重配置均可实现相同功能。哪些内部结构反映真实计算过程,哪些只是特定训练过程的偶然产物?本研究提取出算法核心——即任务执行所必需且充分的紧凑子空间。独立训练的Transformer模型虽学习到不同权重,但最终收敛至相同的核心结构。马尔可夫链Transformer将三维核心嵌入近乎正交的子空间,却能复现完全相同的转移谱;模加运算Transformer在顿悟期发现紧凑循环算子,后期发生膨胀,由此建立从记忆到泛化转变的预测模型。GPT-2语言模型通过单一轴线控制主谓一致,翻转该轴线即可在全尺度生成过程中反转语法数。这些结果揭示了跨越不同训练过程与模型尺度的低维不变量,表明Transformer计算本质上是围绕紧凑共享的算法结构组织的。针对此类计算本质(而非实现细节)的不变量进行研究,或将推动机械可解释性领域的突破。
English
Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obstacle is that training selects for behavior, not circuitry, so many weight configurations can implement the same function. Which internal structures reflect the computation, and which are accidents of a particular training run? This work extracts algorithmic cores: compact subspaces necessary and sufficient for task performance. Independently trained transformers learn different weights but converge to the same cores. Markov-chain transformers embed 3D cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers discover compact cyclic operators at grokking that later inflate, yielding a predictive model of the memorization-to-generalization transition. GPT-2 language models govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number throughout generation across scales. These results reveal low-dimensional invariants that persist across training runs and scales, suggesting that transformer computations are organized around compact, shared algorithmic structures. Mechanistic interpretability could benefit from targeting such invariants -- the computational essence -- rather than implementation-specific details.