单一模型，多重预算：扩散变换器的弹性潜空间接口

摘要

扩散变换器（DiT）虽能实现高生成质量，但其计算量（FLOPs）与图像分辨率强耦合，限制了合理的延迟-质量权衡，且对输入空间令牌均匀分配计算资源，导致不重要区域的计算浪费。我们提出弹性潜变量接口变换器（ELIT），作为一种即插即用且兼容DiT的机制，将输入图像尺寸与计算量解耦。该方法通过插入潜变量接口——一个可学习的变长令牌序列，使标准变换器模块可在此序列上操作。轻量级的读写交叉注意力层在空间令牌与潜变量间传递信息，并优先处理重要输入区域。通过随机丢弃尾部潜变量的训练方式，ELIT学会生成按重要性排序的表征：前期潜变量捕获全局结构，后期潜变量则包含细节优化信息。在推理阶段，可动态调整潜变量数量以适应计算约束。ELIT设计极简，仅增加两个交叉注意力层，同时保持修正流目标函数和DiT架构不变。在多个数据集和架构（DiT、U-ViT、HDiT、MM-DiT）上的实验表明，ELIT能带来持续性能提升。在ImageNet-1K 512px任务中，ELIT将FID和FDD分数平均提升35.3%和39.6%。项目页面：https://snap-research.github.io/elit/

English

Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of 35.3% and 39.6% in FID and FDD scores. Project page: https://snap-research.github.io/elit/

单一模型，多重预算：扩散变换器的弹性潜空间接口

One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

摘要

Support