SANA 1.5：線性擴散Transformer中訓練時間和推論時間計算的高效擴展

摘要

本文介紹了 SANA-1.5，一種用於文本到圖像生成的線性擴散變壓器，以實現高效擴展。在 SANA-1.0 的基礎上，我們引入了三項關鍵創新：(1) 高效訓練擴展：一種深度增長範式，可實現從 16 億到 48 億參數的擴展，並顯著降低計算資源，結合了記憶體高效的 8 位優化器。(2) 模型深度修剪：一種區塊重要性分析技術，用於實現對模型進行高效壓縮至任意大小，並最小化質量損失。(3) 推論時擴展：一種重複取樣策略，以交換計算量為模型容量，使較小的模型在推論時能夠達到與較大模型相同的質量。通過這些策略，SANA-1.5 在 GenEval 上實現了 0.72 的文本-圖像對齊分數，通過推論擴展可以進一步提高至 0.80，建立了 GenEval 基準上的新 SoTA。這些創新使得在不同計算預算下實現高效的模型擴展，同時保持高質量，使高質量圖像生成更加可行。

English

This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-image generation. Building upon SANA-1.0, we introduce three key innovations: (1) Efficient Training Scaling: A depth-growth paradigm that enables scaling from 1.6B to 4.8B parameters with significantly reduced computational resources, combined with a memory-efficient 8-bit optimizer. (2) Model Depth Pruning: A block importance analysis technique for efficient model compression to arbitrary sizes with minimal quality loss. (3) Inference-time Scaling: A repeated sampling strategy that trades computation for model capacity, enabling smaller models to match larger model quality at inference time. Through these strategies, SANA-1.5 achieves a text-image alignment score of 0.72 on GenEval, which can be further improved to 0.80 through inference scaling, establishing a new SoTA on GenEval benchmark. These innovations enable efficient model scaling across different compute budgets while maintaining high quality, making high-quality image generation more accessible.

SANA 1.5：線性擴散Transformer中訓練時間和推論時間計算的高效擴展

SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer

摘要

Support