Linéarisation du Vision Transformer avec entraînement en temps de test

Résumé

Bien que les mécanismes d’attention à complexité linéaire offrent une alternative prometteuse à l’attention Softmax pour surmonter le goulot d’étranglement quadratique, l’entraînement de tels modèles à partir de zéro reste prohibitif. L’héritage des poids de Transformers pré-entraînés constitue un raccourci attrayant, mais le fossé représentationnel fondamental entre l’attention Softmax et l’attention linéaire empêche un transfert de poids efficace. Dans ce travail, nous abordons ce défi de conversion sous deux angles : l’alignement architectural et l’alignement représentationnel. Nous identifions le Test-Time Training (TTT) comme une architecture à complexité linéaire dont la formulation dynamique à deux couches est structurellement alignée avec l’attention Softmax, permettant un héritage direct des poids d’attention pré-entraînés. Pour aligner davantage les propriétés représentationnelles, notamment l’invariance par décalage des clés et la localité, nous introduisons une normalisation d’instance des clés et un module léger d’amélioration de la localité. Nous validons notre approche en linéarisant Stable Diffusion 3.5 et présentons SD3.5-T⁵ (Transformer To Test Time Training). Avec seulement une heure de fine-tuning sur 4 × GPU H20, SD3.5-T⁵ atteint une qualité texte-image comparable à celle du modèle Softmax fine-tuné, tout en accélérant l’inférence de 1,32 fois et 1,47 fois aux résolutions 1K et 2K. Le code est disponible à l’adresse https://github.com/LeapLabTHU/Transformer-to-TTT.

English

While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We identify Test-Time Training (TTT) as a linear-complexity architecture whose two-layer dynamic formulation is structurally aligned with Softmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introduce key instance normalization and a lightweight locality enhancement module. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T^5 (Transformer To Test Time Training). With only 1 hour of fine-tuning on 4timesH20 GPUs, SD3.5-T^5 achieves comparable text-to-image quality to the fine-tuned Softmax model, while accelerating inference by 1.32times and 1.47times at 1K and 2K resolutions. Code is available at https://github.com/LeapLabTHU/Transformer-to-TTT.