트랜스포머는 불변 알고리즘 코어로 수렴한다

초록

대규모 언어 모델은 정교한 능력을 보여주지만, 그 내부 작동 방식을 이해하는 것은 여전히 핵심적인 과제로 남아 있습니다. 근본적인 장애물은 훈련이 회로가 아닌 행동을 선택하기 때문에 동일한 기능을 구현하는 다양한 가중치 구성이 가능하다는 점입니다. 어떤 내부 구조가 계산을 반영하고, 어떤 구조가 특정 훈련 과정의 부수적 결과일까요? 본 연구는 알고리즘 핵심(algorithmic core), 즉 과제 수행에 필요하고 충분한 컴팩트한 부분 공간을 추출합니다. 독립적으로 훈련된 트랜스포머는 서로 다른 가중치를 학습하지만 동일한 핵심으로 수렴합니다. 마르코프 체인 트랜스포머는 거의 직교하는 부분 공간에 3차원 핵심을 내장하지만 동일한 전이 스펙트럼을 복원합니다. 모듈러 덧셈(modular-addition) 트랜스포머는 그로킹(grokking) 시점에 컴팩트한 순환 연산자를 발견하고, 이후 팽창하여 암기에서 일반화로의 전환을 예측하는 모델을 제공합니다. GPT-2 언어 모델은 주어-동사 일치를 단일 축으로 제어하며, 이 축을 반전시킬 경우 규모에 관계없이 생성 과정 전체에서 문법적 수(number)가 뒤바뀝니다. 이러한 결과는 훈련 과정과 규모를 아우르며 지속되는 저차원 불변량을 보여주며, 트랜스포머 계산이 컴팩트하고 공유된 알고리즘 구조를 중심으로 조직되어 있음을 시사합니다. 구현체별 세부사항보다는 이러한 계산적 본질인 불변량을 목표로 삼는 것이 기계론적 해석성(mechanistic interpretability) 연구에 도움이 될 수 있습니다.

English

Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obstacle is that training selects for behavior, not circuitry, so many weight configurations can implement the same function. Which internal structures reflect the computation, and which are accidents of a particular training run? This work extracts algorithmic cores: compact subspaces necessary and sufficient for task performance. Independently trained transformers learn different weights but converge to the same cores. Markov-chain transformers embed 3D cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers discover compact cyclic operators at grokking that later inflate, yielding a predictive model of the memorization-to-generalization transition. GPT-2 language models govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number throughout generation across scales. These results reveal low-dimensional invariants that persist across training runs and scales, suggesting that transformer computations are organized around compact, shared algorithmic structures. Mechanistic interpretability could benefit from targeting such invariants -- the computational essence -- rather than implementation-specific details.

트랜스포머는 불변 알고리즘 코어로 수렴한다

Transformers converge to invariant algorithmic cores

초록

Support