Transformer模型收敛于不变算法核心

摘要

大型语言模型展现出复杂精妙的能力，但理解其内部运作机制仍是核心挑战。根本性难题在于：训练过程筛选的是行为表现而非电路结构，因此众多权重配置均可实现同一功能。哪些内部结构反映计算本质，哪些又只是特定训练过程中的偶然产物？本研究提出算法核心的概念——即紧凑且对任务性能必要而充分的子空间。独立训练的Transformer模型虽学习到不同权重，却收敛至相同的算法核心。马尔可夫链Transformer模型将三维核心嵌入近乎正交的子空间，却能复现相同的转移谱；模数加法Transformer在顿悟期发现紧凑循环算子，后期发生膨胀，由此建立了记忆到泛化转变的预测模型。GPT-2语言模型通过单一轴线控制主谓一致关系，当该轴线数值翻转时，会在不同规模下全面逆转生成文本的语法数。这些研究结果揭示了跨越训练过程和模型规模的低维不变量，表明Transformer的计算是围绕紧凑共享的算法结构组织的。针对此类计算本质（而非具体实现细节）的不变量进行研究，或将推动机械可解释性领域的发展。

English

Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obstacle is that training selects for behavior, not circuitry, so many weight configurations can implement the same function. Which internal structures reflect the computation, and which are accidents of a particular training run? This work extracts algorithmic cores: compact subspaces necessary and sufficient for task performance. Independently trained transformers learn different weights but converge to the same cores. Markov-chain transformers embed 3D cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers discover compact cyclic operators at grokking that later inflate, yielding a predictive model of the memorization-to-generalization transition. GPT-2 language models govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number throughout generation across scales. These results reveal low-dimensional invariants that persist across training runs and scales, suggesting that transformer computations are organized around compact, shared algorithmic structures. Mechanistic interpretability could benefit from targeting such invariants -- the computational essence -- rather than implementation-specific details.

Transformer模型收敛于不变算法核心

Transformers converge to invariant algorithmic cores

摘要

Support