透過測試時訓練線性化視覺變換器

摘要

雖然線性複雜度注意力機制提供了克服Softmax注意力二次瓶頸的可行替代方案，但從頭訓練此類模型仍耗費極高成本。繼承預訓練Transformer的權重提供了誘人的捷徑，然而Softmax注意力與線性注意力之間的根本表徵差距，阻礙了權重的有效轉移。在本研究中，我們從架構對齊與表徵對齊兩個角度解決此轉換挑戰。我們發現測試時訓練（TTT）是一種線性複雜度架構，其雙層動態公式在結構上與Softmax注意力對齊，能夠直接繼承預訓練注意力權重。為進一步對齊表徵特性（包括鍵位移不變性與局部性），我們引入了鍵實例歸一化與輕量級局部性增強模組。我們透過線性化Stable Diffusion 3.5來驗證此方法，並提出SD3.5-T^5（從Transformer到測試時訓練）。僅需在4×H20 GPU上進行1小時微調，SD3.5-T^5即可達到與微調後Softmax模型相當的文字到影像生成品質，同時在1K與2K解析度下分別加速1.32倍與1.47倍的推論速度。程式碼已開源於 https://github.com/LeapLabTHU/Transformer-to-TTT。

English

While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We identify Test-Time Training (TTT) as a linear-complexity architecture whose two-layer dynamic formulation is structurally aligned with Softmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introduce key instance normalization and a lightweight locality enhancement module. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T^5 (Transformer To Test Time Training). With only 1 hour of fine-tuning on 4timesH20 GPUs, SD3.5-T^5 achieves comparable text-to-image quality to the fine-tuned Softmax model, while accelerating inference by 1.32times and 1.47times at 1K and 2K resolutions. Code is available at https://github.com/LeapLabTHU/Transformer-to-TTT.