ELT：用於視覺生成的彈性循環變壓器

摘要

我們提出彈性循環轉換器（ELT），這是一類基於循環式轉換器架構、具備極高參數效率的視覺生成模型。傳統生成模型依賴深層堆疊的獨立轉換器層，而我們的方法採用迭代式權重共享的轉換器模塊，在保持高合成品質的同時大幅降低參數數量。為有效訓練這些模型進行圖像與影片生成，我們提出「循環內自蒸餾」（ILSD）技術，透過將學生配置（中間循環次數）從教師配置（最大訓練循環次數）進行知識蒸餾，確保單步訓練中模型深度的一致性。我們的框架能從單次訓練運行中產出具有彈性的模型系列，實現「任意時機推理」能力，在固定參數量的前提下動態權衡計算成本與生成品質。ELT顯著推進了視覺合成的效率邊界：在等推理計算量設定下，ELT以僅需1/4參數量的優勢，於類別條件式ImageNet 256×256數據集達成FID 2.0的競爭性成績，並在類別條件式UCF-101數據集實現FVD 72.8的表現。

English

We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model's depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With 4times reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of 2.0 on class-conditional ImageNet 256 times 256 and FVD of 72.8 on class-conditional UCF-101.

ELT：用於視覺生成的彈性循環變壓器

ELT: Elastic Looped Transformers for Visual Generation

摘要

Support