Reflect-DiT:通过上下文反射实现文本到图像扩散变换器的推理时缩放
Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
March 15, 2025
作者: Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Arsh Koneru, Yusuke Kato, Kazuki Kozuka, Aditya Grover
cs.AI
摘要
推动文本到图像生成的主流方法一直是训练阶段的扩展,即利用更多数据和更大计算资源训练更庞大的模型。虽然有效,但这种方法计算成本高昂,因此人们越来越关注通过推理阶段的扩展来提升性能。目前,文本到图像扩散模型的推理扩展主要局限于N选一采样,即每个提示生成多幅图像,再由选择模型挑出最佳输出。受近期如DeepSeek-R1等推理模型在语言领域取得成功的启发,我们提出了一种替代简单N选一采样的方法,通过赋予文本到图像扩散变换器上下文反思能力。我们提出了Reflect-DiT,该方法使扩散变换器能够利用先前生成图像的上下文示例及描述必要改进的文本反馈来优化其生成结果。Reflect-DiT不再被动依赖随机采样并寄希望于未来生成更好的结果,而是明确针对需要改进的具体方面定制其生成内容。实验结果表明,以SANA-1.0-1.6B为基础模型,Reflect-DiT在GenEval基准测试上提升了0.19分。此外,它仅需每个提示生成20个样本,就在GenEval上取得了0.81的新纪录,超越了之前使用更大模型(SANA-1.5-4.8B)在N选一方法下生成2048个样本所获得的0.80分的最佳成绩。
English
The predominant approach to advancing text-to-image generation has been
training-time scaling, where larger models are trained on more data using
greater computational resources. While effective, this approach is
computationally expensive, leading to growing interest in inference-time
scaling to improve performance. Currently, inference-time scaling for
text-to-image diffusion models is largely limited to best-of-N sampling, where
multiple images are generated per prompt and a selection model chooses the best
output. Inspired by the recent success of reasoning models like DeepSeek-R1 in
the language domain, we introduce an alternative to naive best-of-N sampling by
equipping text-to-image Diffusion Transformers with in-context reflection
capabilities. We propose Reflect-DiT, a method that enables Diffusion
Transformers to refine their generations using in-context examples of
previously generated images alongside textual feedback describing necessary
improvements. Instead of passively relying on random sampling and hoping for a
better result in a future generation, Reflect-DiT explicitly tailors its
generations to address specific aspects requiring enhancement. Experimental
results demonstrate that Reflect-DiT improves performance on the GenEval
benchmark (+0.19) using SANA-1.0-1.6B as a base model. Additionally, it
achieves a new state-of-the-art score of 0.81 on GenEval while generating only
20 samples per prompt, surpassing the previous best score of 0.80, which was
obtained using a significantly larger model (SANA-1.5-4.8B) with 2048 samples
under the best-of-N approach.Summary
AI-Generated Summary