弹性循环变换器:面向视觉生成的弹性循环Transformer架构
ELT: Elastic Looped Transformers for Visual Generation
April 10, 2026
作者: Sahil Goyal, Swayam Agrawal, Gautham Govind Anil, Prateek Jain, Sujoy Paul, Aditya Kusupati
cs.AI
摘要
我们提出弹性循环变换器(ELT)——基于循环变换器架构的高参数效率视觉生成模型。传统生成模型依赖深层堆叠的独立变换器层,而我们的方法采用迭代式权重共享变换器块,在保持高合成质量的同时大幅降低参数量。为有效训练这些图像与视频生成模型,我们提出"循环内自蒸馏"(ILSD)方法,通过从教师配置(最大训练循环数)蒸馏至学生配置(中间循环数),确保单步训练中模型深度的一致性。该框架通过单次训练即可获得弹性模型家族,实现具备"任意时刻"推理能力的动态计算成本与生成质量权衡,且参数量保持不变。ELT显著推动了视觉合成的效率边界:在等推理计算量设置下参数量减少4倍的同时,在类别条件ImageNet 256×256数据集上达到2.0的竞争性FID分数,在类别条件UCF-101数据集上实现72.8的FVD分数。
English
We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model's depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With 4times reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of 2.0 on class-conditional ImageNet 256 times 256 and FVD of 72.8 on class-conditional UCF-101.