三相变压器

摘要

我们提出三相变换器（3PT），一种专为仅解码器Transformer设计的残差流结构先验，其基于标准的SwiGLU+RMSNorm+RoPE+GQA架构。该模型将隐向量划分为N个等大的循环通道，每个通道通过相位保持操作进行维护：包括逐通道RMSNorm、在注意力与前馈网络间实施的二维Givens旋转（使各通道旋转θ + i*(2π/N)角度），以及令GQA头数与通道划分对齐的约束机制。该架构是扰乱与重校过程自稳定的动态平衡体系，而非外挂模块。通道划分天然形成了一维直流子空间，该空间与各通道正交，我们向其注入固定的加布里埃尔号角函数r(p)=1/(p+1)作为绝对位置侧通道，与RoPE的相对位置旋转正交组合。经典N=3配置借用了平衡三相交流电的隐喻——三个相位差120度的正弦波叠加归零且不存在反相关对。在WikiText-103数据集上，参数量为1.23亿的3PT模型相比仅使用RoPE的基线（增加1,536参数，占总参数量0.00124%），困惑度降低7.20%（每字节比特数下降2.62%），收敛步数加速1.93倍（实际耗时加速1.64倍）。N表现为参数共享调节钮而非唯一最优解：在550万参数规模下，对{1,2,3,4,6,8,12}的N值扫描显示N=1最优且呈近单调趋势；而在1.23亿参数规模下，三次随机种子实验表明N=3与N=1统计无差异。核心机制包括通道划分的残差流、逐块旋转、分相归一化及号角直流注入。我们刻画了三大特性：（a）几何结构的自稳定性无需显式约束，成为神经网络守恒律框架的新实例；（b）12层网络中出现旋转角度漂移的U型深度分布；（c）与RoPE、注意力及前馈网络的正交组合性。

English

We present Three-Phase Transformer (3PT), a residual-stream structural prior for decoder-only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. The hidden vector is partitioned into N equally-sized cyclic channels, each maintained by phase-respecting ops: a per-channel RMSNorm, a 2D Givens rotation between attention and FFN that rotates each channel by theta + i*(2*pi/N), and a head-count constraint aligning GQA heads with the partition. The architecture is a self-stabilizing equilibrium between scrambling and re-imposition, not a bolted-on module. The partition carves out a one-dimensional DC subspace orthogonal to the channels, into which we inject a fixed Gabriel's horn profile r(p) = 1/(p+1) as an absolute-position side-channel composing orthogonally with RoPE's relative-position rotation. The canonical N=3 borrows its metaphor from balanced three-phase AC, where three sinusoids 120 degrees apart sum to zero with no anti-correlated pair. At 123M parameters on WikiText-103, 3PT achieves -7.20% perplexity (-2.62% bits-per-byte) over a matched RoPE-Only baseline at +1,536 parameters (0.00124% of total), with 1.93x step-count convergence speedup (1.64x wall-clock). N behaves as a parameter-sharing knob rather than a unique optimum: at 5.5M an N-sweep over {1,2,3,4,6,8,12} is near-monotone with N=1 winning; at 123M a three-seed sweep finds N=3 and N=1 statistically indistinguishable. The load-bearing mechanism is the channel-partitioned residual stream, per-block rotation, per-phase normalization, and horn DC injection. We characterize (a) self-stabilization of the geometry without explicit enforcement, a novel instance of the conservation-law framework for neural networks; (b) a U-shaped depth profile of rotation-angle drift at 12 layers; (c) orthogonal composition with RoPE, attention, and FFN.