Transformateur triphasé

Résumé

Nous présentons Three-Phase Transformer (3PT), un a priori structurel pour le flux résiduel des Transformers décodeurs-only, construit sur une architecture standard SwiGLU + RMSNorm + RoPE + GQA. Le vecteur caché est partitionné en N canaux cycliques de taille égale, chacun maintenu par des opérations respectueuses de la phase : une RMSNorm par canal, une rotation 2D de Givens entre l'attention et le FFN qui tourne chaque canal d'un angle thêta + i*(2*pi/N), et une contrainte sur le nombre de têtes alignant les têtes GQA avec la partition. L'architecture est un équilibre auto-stabilisant entre brouillage et réimposition, et non un module greffé. La partition délimite un sous-espace DC unidimensionnel orthogonal aux canaux, dans lequel nous injectons un profil fixe en corne de Gabriel r(p) = 1/(p+1) comme canal latéral de position absolue, composant orthogonalement avec la rotation de position relative de RoPE. La configuration canonique N=3 emprunte sa métaphore au courant alternatif triphasé équilibré, où trois sinusoides décalées de 120 degrés ont une somme nulle sans paire anti-corrélée. Avec 123M de paramètres sur WikiText-103, 3PT atteint une perplexité inférieure de -7,20 % (-2,62 % de bits par octet) par rapport à une baseline RoPE-Only équivalente, pour un surcoût de +1 536 paramètres (0,00124 % du total), avec une accélération de convergence de 1,93x en nombre d'étapes (1,64x en temps réel). N se comporte comme un potentiomètre de partage de paramètres plutôt qu'un optimum unique : à 5,5M de paramètres, un balayage de N sur {1,2,3,4,6,8,12} est quasi monotone, N=1 étant le meilleur ; à 123M, un balayage sur trois seeds trouve N=3 et N=1 statistiquement indiscernables. Le mécanisme porteur est le flux résiduel partitionné en canaux, la rotation par bloc, la normalisation par phase et l'injection DC en corne. Nous caractérisons (a) l'auto-stabilisation de la géométrie sans contrainte explicite, une nouvelle instance du cadre des lois de conservation pour les réseaux neuronaux ; (b) un profil de profondeur en U de la dérive de l'angle de rotation sur 12 couches ; (c) la composition orthogonale avec RoPE, l'attention et le FFN.

English

We present Three-Phase Transformer (3PT), a residual-stream structural prior for decoder-only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. The hidden vector is partitioned into N equally-sized cyclic channels, each maintained by phase-respecting ops: a per-channel RMSNorm, a 2D Givens rotation between attention and FFN that rotates each channel by theta + i*(2*pi/N), and a head-count constraint aligning GQA heads with the partition. The architecture is a self-stabilizing equilibrium between scrambling and re-imposition, not a bolted-on module. The partition carves out a one-dimensional DC subspace orthogonal to the channels, into which we inject a fixed Gabriel's horn profile r(p) = 1/(p+1) as an absolute-position side-channel composing orthogonally with RoPE's relative-position rotation. The canonical N=3 borrows its metaphor from balanced three-phase AC, where three sinusoids 120 degrees apart sum to zero with no anti-correlated pair. At 123M parameters on WikiText-103, 3PT achieves -7.20% perplexity (-2.62% bits-per-byte) over a matched RoPE-Only baseline at +1,536 parameters (0.00124% of total), with 1.93x step-count convergence speedup (1.64x wall-clock). N behaves as a parameter-sharing knob rather than a unique optimum: at 5.5M an N-sweep over {1,2,3,4,6,8,12} is near-monotone with N=1 winning; at 123M a three-seed sweep finds N=3 and N=1 statistically indistinguishable. The load-bearing mechanism is the channel-partitioned residual stream, per-block rotation, per-phase normalization, and horn DC injection. We characterize (a) self-stabilization of the geometry without explicit enforcement, a novel instance of the conservation-law framework for neural networks; (b) a U-shaped depth profile of rotation-angle drift at 12 layers; (c) orthogonal composition with RoPE, attention, and FFN.