ChatPaper.aiChatPaper

自评估机制解锁任意步骤的文本到图像生成

Self-Evaluation Unlocks Any-Step Text-to-Image Generation

December 26, 2025
作者: Xin Yu, Xiaojuan Qi, Zhengqi Li, Kai Zhang, Richard Zhang, Zhe Lin, Eli Shechtman, Tianyu Wang, Yotam Nitzan
cs.AI

摘要

我们提出自评估模型(Self-E),这是一种新颖的文本到图像生成从头训练方法,支持任意步数推理。Self-E采用与流匹配模型类似的数据学习方式,同时引入创新的自评估机制:模型利用当前分数估计对自身生成样本进行评估,实质上充当了动态自监督教师。与传统扩散模型或流模型不同,该方法不依赖通常需要多步推理的局部监督;与基于蒸馏的方法相比,它无需预训练教师模型。这种即时局部学习与自驱动全局匹配的结合,成功弥合了两种范式间的鸿沟,使得从头训练出的高质量文本到图像模型即使在极低步数下也能表现出色。在大规模文本到图像基准测试上的广泛实验表明,Self-E不仅在少步生成中表现卓越,在50步推理时亦可与最先进的流匹配模型媲美。我们进一步发现其性能随推理步数增加呈单调提升趋势,使得单个统一模型既能实现超快速少步生成,又能完成高质量长轨迹采样。据我们所知,Self-E是首个支持任意步数的从头训练文本到图像模型,为高效可扩展生成提供了统一框架。
English
We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation that supports any-step inference. Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism: it evaluates its own generated samples using its current score estimates, effectively serving as a dynamic self-teacher. Unlike traditional diffusion or flow models, it does not rely solely on local supervision, which typically necessitates many inference steps. Unlike distillation-based approaches, it does not require a pretrained teacher. This combination of instantaneous local learning and self-driven global matching bridges the gap between the two paradigms, enabling the training of a high-quality text-to-image model from scratch that excels even at very low step counts. Extensive experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps. We further find that its performance improves monotonically as inference steps increase, enabling both ultra-fast few-step generation and high-quality long-trajectory sampling within a single unified model. To our knowledge, Self-E is the first from-scratch, any-step text-to-image model, offering a unified framework for efficient and scalable generation.
PDF11December 31, 2025