测试时训练线性化视觉Transformer

摘要

虽然线性复杂度注意力机制为克服Softmax注意力的平方复杂度瓶颈提供了有前景的替代方案，但从头开始训练此类模型的成本仍然高得令人望而却步。继承预训练Transformer的权重提供了一条有吸引力的捷径，然而Softmax与线性注意力之间在表征上的根本差异阻碍了有效的权重迁移。在本工作中，我们从架构对齐和表征对齐两个角度来解决这一转换挑战。我们发现测试时训练（TTT）是一种线性复杂度架构，其双层动态公式在结构上与Softmax注意力对齐，从而能够直接继承预训练的注意力权重。为了进一步对齐表征属性（包括键平移不变性和局部性），我们引入了键实例归一化和一个轻量级的局部性增强模块。我们通过线性化Stable Diffusion 3.5来验证该方法，并提出了SD3.5-T³（Transformer到测试时训练）。仅在4块H20 GPU上微调1小时，SD3.5-T³即可达到与微调后的Softmax模型相当的文字到图像生成质量，同时在1K和2K分辨率下分别实现了1.32倍和1.47倍的推理加速。代码已开源：https://github.com/LeapLabTHU/Transformer-to-TTT。

English

While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We identify Test-Time Training (TTT) as a linear-complexity architecture whose two-layer dynamic formulation is structurally aligned with Softmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introduce key instance normalization and a lightweight locality enhancement module. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T^5 (Transformer To Test Time Training). With only 1 hour of fine-tuning on 4timesH20 GPUs, SD3.5-T^5 achieves comparable text-to-image quality to the fine-tuned Softmax model, while accelerating inference by 1.32times and 1.47times at 1K and 2K resolutions. Code is available at https://github.com/LeapLabTHU/Transformer-to-TTT.