トランスフォーマーは不変のアルゴリズム的核心へ収束する

要旨

大規模言語モデルは高度な能力を示す一方で、その内部動作の理解は依然として中心的な課題である。根本的な障壁は、訓練が行動ではなく回路を選択するため、同一の機能を実装する重み構成が無数に存在しうる点にある。どの内部構造が計算を反映し、どの構造が特定の訓練実行の偶発的産物なのか？本研究は、タスク性能に必要かつ十分なコンパクトな部分空間である「アルゴリズムコア」を抽出する。独立して訓練されたトランスフォーマーは異なる重みを学習するが、同じコアに収束する。マルコフ連鎖トランスフォーマーは、ほぼ直交する部分空間に3次元コアを埋め込みながら、同一の遷移スペクトルを回復する。モジュラー加算トランスフォーマーは、グロッキング現象時にコンパクトな巡回演算子を発見し、後にそれが膨張して記憶から一般化への移行を予測するモデルを生み出す。GPT-2言語モデルは、単一の軸によって主語と動詞の一致を制御しており、この軸を反転させると規模を超えて生成全体における文法的数が反転する。これらの結果は、訓練実行や規模を超えて持続する低次元不変量を明らかにし、トランスフォーマーの計算がコンパクトで共有されたアルゴリズム構造を中心に組織されていることを示唆する。機械論的解釈可能性は、実装固有の詳細ではなく、こうした不変量（計算の本質）を標的とすることで進展が期待できる。

English

Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obstacle is that training selects for behavior, not circuitry, so many weight configurations can implement the same function. Which internal structures reflect the computation, and which are accidents of a particular training run? This work extracts algorithmic cores: compact subspaces necessary and sufficient for task performance. Independently trained transformers learn different weights but converge to the same cores. Markov-chain transformers embed 3D cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers discover compact cyclic operators at grokking that later inflate, yielding a predictive model of the memorization-to-generalization transition. GPT-2 language models govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number throughout generation across scales. These results reveal low-dimensional invariants that persist across training runs and scales, suggesting that transformer computations are organized around compact, shared algorithmic structures. Mechanistic interpretability could benefit from targeting such invariants -- the computational essence -- rather than implementation-specific details.

トランスフォーマーは不変のアルゴリズム的核心へ収束する

Transformers converge to invariant algorithmic cores

要旨

Support