SANA 1.5:線性擴散Transformer中訓練時間和推論時間計算的高效擴展
SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer
January 30, 2025
作者: Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, Bingchen Liu, Daquan Zhou, Song Han
cs.AI
摘要
本文介紹了 SANA-1.5,一種用於文本到圖像生成的線性擴散變壓器,以實現高效擴展。在 SANA-1.0 的基礎上,我們引入了三項關鍵創新:(1) 高效訓練擴展:一種深度增長範式,可實現從 16 億到 48 億參數的擴展,並顯著降低計算資源,結合了記憶體高效的 8 位優化器。(2) 模型深度修剪:一種區塊重要性分析技術,用於實現對模型進行高效壓縮至任意大小,並最小化質量損失。(3) 推論時擴展:一種重複取樣策略,以交換計算量為模型容量,使較小的模型在推論時能夠達到與較大模型相同的質量。通過這些策略,SANA-1.5 在 GenEval 上實現了 0.72 的文本-圖像對齊分數,通過推論擴展可以進一步提高至 0.80,建立了 GenEval 基準上的新 SoTA。這些創新使得在不同計算預算下實現高效的模型擴展,同時保持高質量,使高質量圖像生成更加可行。
English
This paper presents SANA-1.5, a linear Diffusion Transformer for efficient
scaling in text-to-image generation. Building upon SANA-1.0, we introduce three
key innovations: (1) Efficient Training Scaling: A depth-growth paradigm that
enables scaling from 1.6B to 4.8B parameters with significantly reduced
computational resources, combined with a memory-efficient 8-bit optimizer. (2)
Model Depth Pruning: A block importance analysis technique for efficient model
compression to arbitrary sizes with minimal quality loss. (3) Inference-time
Scaling: A repeated sampling strategy that trades computation for model
capacity, enabling smaller models to match larger model quality at inference
time. Through these strategies, SANA-1.5 achieves a text-image alignment score
of 0.72 on GenEval, which can be further improved to 0.80 through inference
scaling, establishing a new SoTA on GenEval benchmark. These innovations enable
efficient model scaling across different compute budgets while maintaining high
quality, making high-quality image generation more accessible.Summary
AI-Generated Summary