弹性循环变换器：面向视觉生成的弹性循环Transformer架构

摘要

我们提出弹性循环变换器（ELT）——基于循环变换器架构的高参数效率视觉生成模型。传统生成模型依赖深层堆叠的独立变换器层，而我们的方法采用迭代式权重共享变换器块，在保持高合成质量的同时大幅降低参数量。为有效训练这些图像与视频生成模型，我们提出"循环内自蒸馏"（ILSD）方法，通过从教师配置（最大训练循环数）蒸馏至学生配置（中间循环数），确保单步训练中模型深度的一致性。该框架通过单次训练即可获得弹性模型家族，实现具备"任意时刻"推理能力的动态计算成本与生成质量权衡，且参数量保持不变。ELT显著推动了视觉合成的效率边界：在等推理计算量设置下参数量减少4倍的同时，在类别条件ImageNet 256×256数据集上达到2.0的竞争性FID分数，在类别条件UCF-101数据集上实现72.8的FVD分数。

English

We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model's depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With 4times reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of 2.0 on class-conditional ImageNet 256 times 256 and FVD of 72.8 on class-conditional UCF-101.

弹性循环变换器：面向视觉生成的弹性循环Transformer架构

ELT: Elastic Looped Transformers for Visual Generation

摘要

Support