Transformer模型收敛于不变算法核心
Transformers converge to invariant algorithmic cores
February 26, 2026
作者: Joshua S. Schiffman
cs.AI
摘要
大型语言模型展现出复杂精妙的能力,但理解其内部运作机制仍是核心挑战。根本性难题在于:训练过程筛选的是行为表现而非电路结构,因此众多权重配置均可实现同一功能。哪些内部结构反映计算本质,哪些又只是特定训练过程中的偶然产物?本研究提出算法核心的概念——即紧凑且对任务性能必要而充分的子空间。独立训练的Transformer模型虽学习到不同权重,却收敛至相同的算法核心。马尔可夫链Transformer模型将三维核心嵌入近乎正交的子空间,却能复现相同的转移谱;模数加法Transformer在顿悟期发现紧凑循环算子,后期发生膨胀,由此建立了记忆到泛化转变的预测模型。GPT-2语言模型通过单一轴线控制主谓一致关系,当该轴线数值翻转时,会在不同规模下全面逆转生成文本的语法数。这些研究结果揭示了跨越训练过程和模型规模的低维不变量,表明Transformer的计算是围绕紧凑共享的算法结构组织的。针对此类计算本质(而非具体实现细节)的不变量进行研究,或将推动机械可解释性领域的发展。
English
Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obstacle is that training selects for behavior, not circuitry, so many weight configurations can implement the same function. Which internal structures reflect the computation, and which are accidents of a particular training run? This work extracts algorithmic cores: compact subspaces necessary and sufficient for task performance. Independently trained transformers learn different weights but converge to the same cores. Markov-chain transformers embed 3D cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers discover compact cyclic operators at grokking that later inflate, yielding a predictive model of the memorization-to-generalization transition. GPT-2 language models govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number throughout generation across scales. These results reveal low-dimensional invariants that persist across training runs and scales, suggesting that transformer computations are organized around compact, shared algorithmic structures. Mechanistic interpretability could benefit from targeting such invariants -- the computational essence -- rather than implementation-specific details.