Reflect-DiT：通過上下文反射實現文本到圖像擴散變壓器的推理時縮放

摘要

推動文本到圖像生成的主要方法是訓練時擴展，即使用更多的計算資源在更大的數據集上訓練更大的模型。雖然這種方法有效，但其計算成本高昂，因此人們對推理時擴展以提升性能的興趣日益增長。目前，文本到圖像擴散模型的推理時擴展主要局限於最佳N採樣，即每個提示生成多張圖像，然後由選擇模型挑選最佳輸出。受近期語言領域中如DeepSeek-R1等推理模型成功的啟發，我們引入了一種替代簡單最佳N採樣的方法，通過為文本到圖像擴散變壓器配備上下文反思能力。我們提出了Reflect-DiT，該方法使擴散變壓器能夠利用先前生成圖像的上下文示例以及描述必要改進的文本反饋來精煉其生成結果。Reflect-DiT不再被動依賴隨機採樣並寄望於未來生成中獲得更好的結果，而是明確地針對需要改進的特定方面定制其生成。實驗結果表明，Reflect-DiT在GenEval基準上使用SANA-1.0-1.6B作為基礎模型提升了性能（+0.19）。此外，它在GenEval上達到了0.81的新最高分，而每個提示僅生成20個樣本，超越了之前使用顯著更大模型（SANA-1.5-4.8B）在最佳N採樣下生成2048個樣本所獲得的0.80分。

English

The predominant approach to advancing text-to-image generation has been training-time scaling, where larger models are trained on more data using greater computational resources. While effective, this approach is computationally expensive, leading to growing interest in inference-time scaling to improve performance. Currently, inference-time scaling for text-to-image diffusion models is largely limited to best-of-N sampling, where multiple images are generated per prompt and a selection model chooses the best output. Inspired by the recent success of reasoning models like DeepSeek-R1 in the language domain, we introduce an alternative to naive best-of-N sampling by equipping text-to-image Diffusion Transformers with in-context reflection capabilities. We propose Reflect-DiT, a method that enables Diffusion Transformers to refine their generations using in-context examples of previously generated images alongside textual feedback describing necessary improvements. Instead of passively relying on random sampling and hoping for a better result in a future generation, Reflect-DiT explicitly tailors its generations to address specific aspects requiring enhancement. Experimental results demonstrate that Reflect-DiT improves performance on the GenEval benchmark (+0.19) using SANA-1.0-1.6B as a base model. Additionally, it achieves a new state-of-the-art score of 0.81 on GenEval while generating only 20 samples per prompt, surpassing the previous best score of 0.80, which was obtained using a significantly larger model (SANA-1.5-4.8B) with 2048 samples under the best-of-N approach.

Reflect-DiT：通過上下文反射實現文本到圖像擴散變壓器的推理時縮放

Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection

摘要

Support