ChatPaper.aiChatPaper

三相变压器

Three-Phase Transformer

April 15, 2026
作者: Mohammad R. Abu Ayyash
cs.AI

摘要

我们提出三相变压器(3PT),一种基于标准SwiGLU+RMSNorm+RoPE+GQA架构的解码器专用Transformer残差流结构先验。该模型将隐藏向量划分为N个等尺寸的循环通道,每个通道通过相位保持算子维护:包括通道级RMSNorm、注意力与前馈网络间实施2D吉文斯旋转(每个通道旋转角度为θ+i*(2π/N)),以及令GQA头数与通道划分对齐的约束机制。该架构实现了扰动与重整之间的自稳定平衡,而非简单附加模块。通道划分天然形成与各通道正交的一维直流子空间,我们向其注入加布里埃尔号角函数r(p)=1/(p+1)作为绝对位置侧通道,与RoPE的相对位置旋转实现正交组合。经典N=3配置借鉴平衡三相交流电隐喻,三个相位差120度的正弦波叠加为零且无任何反相关对。在WikiText-103的1.23亿参数规模下,3PT以仅增加1,536参数(总量0.00124%)的代价,相比纯RoPE基线困惑度降低7.20%(每字节比特数下降2.62%),收敛步数加速1.93倍(实际耗时加速1.64倍)。N表现为参数共享调节钮而非唯一最优解:在550万参数规模下,对{1,2,3,4,6,8,12}的N值扫描显示N=1最优;而在1.23亿参数规模下,三次随机种子实验表明N=3与N=1统计无差异。核心机制包括通道划分残差流、块间旋转、相位归一化及号角直流注入。我们重点阐释:(a)几何结构的无显式约束自稳定现象,此为神经网络守恒律框架的新实例;(b)12层深度下旋转角漂移的U形分布特征;(c)与RoPE、注意力及前馈网络的正交组合特性。
English
We present Three-Phase Transformer (3PT), a residual-stream structural prior for decoder-only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. The hidden vector is partitioned into N equally-sized cyclic channels, each maintained by phase-respecting ops: a per-channel RMSNorm, a 2D Givens rotation between attention and FFN that rotates each channel by theta + i*(2*pi/N), and a head-count constraint aligning GQA heads with the partition. The architecture is a self-stabilizing equilibrium between scrambling and re-imposition, not a bolted-on module. The partition carves out a one-dimensional DC subspace orthogonal to the channels, into which we inject a fixed Gabriel's horn profile r(p) = 1/(p+1) as an absolute-position side-channel composing orthogonally with RoPE's relative-position rotation. The canonical N=3 borrows its metaphor from balanced three-phase AC, where three sinusoids 120 degrees apart sum to zero with no anti-correlated pair. At 123M parameters on WikiText-103, 3PT achieves -7.20% perplexity (-2.62% bits-per-byte) over a matched RoPE-Only baseline at +1,536 parameters (0.00124% of total), with 1.93x step-count convergence speedup (1.64x wall-clock). N behaves as a parameter-sharing knob rather than a unique optimum: at 5.5M an N-sweep over {1,2,3,4,6,8,12} is near-monotone with N=1 winning; at 123M a three-seed sweep finds N=3 and N=1 statistically indistinguishable. The load-bearing mechanism is the channel-partitioned residual stream, per-block rotation, per-phase normalization, and horn DC injection. We characterize (a) self-stabilization of the geometry without explicit enforcement, a novel instance of the conservation-law framework for neural networks; (b) a U-shaped depth profile of rotation-angle drift at 12 layers; (c) orthogonal composition with RoPE, attention, and FFN.
PDF14April 18, 2026