ELT:用於視覺生成的彈性循環變壓器
ELT: Elastic Looped Transformers for Visual Generation
April 10, 2026
作者: Sahil Goyal, Swayam Agrawal, Gautham Govind Anil, Prateek Jain, Sujoy Paul, Aditya Kusupati
cs.AI
摘要
我們提出彈性循環轉換器(ELT),這是一類基於循環式轉換器架構、具備極高參數效率的視覺生成模型。傳統生成模型依賴深層堆疊的獨立轉換器層,而我們的方法採用迭代式權重共享的轉換器模塊,在保持高合成品質的同時大幅降低參數數量。為有效訓練這些模型進行圖像與影片生成,我們提出「循環內自蒸餾」(ILSD)技術,透過將學生配置(中間循環次數)從教師配置(最大訓練循環次數)進行知識蒸餾,確保單步訓練中模型深度的一致性。我們的框架能從單次訓練運行中產出具有彈性的模型系列,實現「任意時機推理」能力,在固定參數量的前提下動態權衡計算成本與生成品質。ELT顯著推進了視覺合成的效率邊界:在等推理計算量設定下,ELT以僅需1/4參數量的優勢,於類別條件式ImageNet 256×256數據集達成FID 2.0的競爭性成績,並在類別條件式UCF-101數據集實現FVD 72.8的表現。
English
We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model's depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With 4times reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of 2.0 on class-conditional ImageNet 256 times 256 and FVD of 72.8 on class-conditional UCF-101.