삼상 변압기

초록

본 논문에서는 SwiGLU + RMSNorm + RoPE + GQA를 기본 백본으로 하는 디코더 전용 Transformer에 대한 잔여 스트림 구조적 사전(prior)인 3상 변환기(Three-Phase Transformer, 3PT)를 제안합니다. 은닉 벡터는 N개의 동일한 크기를 가진 순환 채널(cyclic channels)로 분할되며, 각 채널은 위상 존중 연산(phase-respecting ops)으로 유지됩니다. 이러한 연산에는 채널별 RMSNorm, 어텐션과 FFN 사이에 각 채널을 theta + i*(2*pi/N)만큼 회전시키는 2D 기븐스 회전(Givens rotation), 그리고 GQA 헤드 수를 분할 구조에 맞추는 헤드 수 제약(head-count constraint)이 포함됩니다. 이 아키텍처는 부가적인 모듈이 아닌, 스크램블링(scrambling)과 재부과(re-imposition) 사이의 자가 안정화 균형(self-stabilizing equilibrium) 상태입니다. 분할 구조는 채널들에 직교하는 1차원 DC 부분 공간을 형성하며, 여기에 RoPE의 상대 위치 회전과 직교적으로 결합되는 절대 위치 사이드 채널(absolute-position side-channel)로서 고정된 가브리엘의 나팔 프로파일(Gabriel's horn profile) r(p) = 1/(p+1)을 주입합니다. 표준 설정인 N=3은 균형 3상 교류(AC)에서 비유를 차용하였는데, 여기서 120도 차이가 나는 세 개의 사인파는 상호 반비례 쌍 없이 합이 0이 됩니다. WikiText-103 기준 123M 매개변수 규모에서, 3PT는 매칭된 RoPE-Only 기준 모델 대비 +1,536개의 매개변수(전체의 0.00124%)만 추가하여 퍼플렉서티 -7.20%(바이트당 비트 수 -2.62%)를 달성했으며, 수렴 속도는 스텝 수 기준 1.93배(벽시계 시간 기준 1.64배) 빨라졌습니다. N은 고유한 최적점이라기보다 매개변수 공유 정도를 조절하는 손잡이(parameter-sharing knob) 역할을 합니다: 5.5M 규모에서는 {1,2,3,4,6,8,12}에 대한 N 스윕에서 N=1이 승리하며 거의 단조로운 양상을 보였으나, 123M 규모에서 3개의 시드에 대한 스윕 결과 N=3과 N=1은 통계적으로 구분되지 않았습니다. 핵심 작동 메커니즘은 채널 분할 잔여 스트림, 블록별 회전, 위상별 정규화, 그리고 나팔형 DC 주입입니다. 우리는 (a) 명시적 강제 없이 기하학적 구조가 자가 안정화되는 현상(신경망 보존 법칙 프레임워크의 새로운 사례), (b) 12개 레이어에서 관측된 회전각 드리프트(rotation-angle drift)의 U자형 깊이 프로파일, (c) RoPE, 어텐션, FFN과의 직교적 결합 특성을 분석하여 설명합니다.

English

We present Three-Phase Transformer (3PT), a residual-stream structural prior for decoder-only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. The hidden vector is partitioned into N equally-sized cyclic channels, each maintained by phase-respecting ops: a per-channel RMSNorm, a 2D Givens rotation between attention and FFN that rotates each channel by theta + i*(2*pi/N), and a head-count constraint aligning GQA heads with the partition. The architecture is a self-stabilizing equilibrium between scrambling and re-imposition, not a bolted-on module. The partition carves out a one-dimensional DC subspace orthogonal to the channels, into which we inject a fixed Gabriel's horn profile r(p) = 1/(p+1) as an absolute-position side-channel composing orthogonally with RoPE's relative-position rotation. The canonical N=3 borrows its metaphor from balanced three-phase AC, where three sinusoids 120 degrees apart sum to zero with no anti-correlated pair. At 123M parameters on WikiText-103, 3PT achieves -7.20% perplexity (-2.62% bits-per-byte) over a matched RoPE-Only baseline at +1,536 parameters (0.00124% of total), with 1.93x step-count convergence speedup (1.64x wall-clock). N behaves as a parameter-sharing knob rather than a unique optimum: at 5.5M an N-sweep over {1,2,3,4,6,8,12} is near-monotone with N=1 winning; at 123M a three-seed sweep finds N=3 and N=1 statistically indistinguishable. The load-bearing mechanism is the channel-partitioned residual stream, per-block rotation, per-phase normalization, and horn DC injection. We characterize (a) self-stabilization of the geometry without explicit enforcement, a novel instance of the conservation-law framework for neural networks; (b) a U-shaped depth profile of rotation-angle drift at 12 layers; (c) orthogonal composition with RoPE, attention, and FFN.