Lineariseren van de Vision Transformer door Testtijd Training

Samenvatting

Hoewel aandachtsmechanismen met lineaire complexiteit een veelbelovend alternatief bieden voor Softmax-aandacht om de kwadratische bottleneck te overwinnen, blijft het trainen van dergelijke modellen vanaf nul prohibitief duur. Het overnemen van gewichten van voorgetrainde Transformers biedt een aantrekkelijke shortcut, maar de fundamentele representatiekloof tussen Softmax- en lineaire aandacht verhindert effectieve gewichtsoverdracht. In dit werk benaderen we deze conversie-uitdaging vanuit twee perspectieven: architecturale afstemming en representatie-afstemming. We identificeren Test-Time Training (TTT) als een architectuur met lineaire complexiteit waarvan de tweelaagse dynamische formulering structureel is afgestemd op Softmax-aandacht, waardoor directe overerving van voorgetrainde aandachtsgewichten mogelijk is. Om de representatie-eigenschappen verder af te stemmen, waaronder sleutelverschuivingsinvariantie en localiteit, introduceren we sleutelinstantienormalisatie en een lichtgewicht localiteitsverbeteringsmodule. We valideren onze aanpak door Stable Diffusion 3.5 te lineariseren en introduceren SD3.5-T^5 (Transformer To Test Time Training). Met slechts 1 uur fijnafstemming op 4×H20 GPU's bereikt SD3.5-T^5 een vergelijkbare text-naar-beeld kwaliteit als het fijngetunede Softmax-model, terwijl de inferentie wordt versneld met een factor 1,32× en 1,47× bij resoluties van 1K en 2K. Code is beschikbaar op https://github.com/LeapLabTHU/Transformer-to-TTT.

English

While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We identify Test-Time Training (TTT) as a linear-complexity architecture whose two-layer dynamic formulation is structurally aligned with Softmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introduce key instance normalization and a lightweight locality enhancement module. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T^5 (Transformer To Test Time Training). With only 1 hour of fine-tuning on 4timesH20 GPUs, SD3.5-T^5 achieves comparable text-to-image quality to the fine-tuned Softmax model, while accelerating inference by 1.32times and 1.47times at 1K and 2K resolutions. Code is available at https://github.com/LeapLabTHU/Transformer-to-TTT.