从类别标签到文本的单步图像生成扩展：基于判别性文本表示的方法

摘要

少步生成一直是长期追求的目标，近期以MeanFlow为代表的单步生成方法取得了显著成果。现有关于MeanFlow的研究主要集中于类别到图像的生成，然而一个直观但尚未探索的方向是将条件从固定类别标签扩展到灵活文本输入，从而实现更丰富的内容创作。与有限的类别标签相比，文本条件对模型理解能力提出了更高要求，需要将强大的文本编码器有效集成到MeanFlow框架中。令人惊讶的是，虽然引入文本条件看似直接，但我们发现采用传统训练策略集成基于LLM的文本编码器会导致性能不尽如人意。为探究根本原因，我们通过详细分析发现：由于MeanFlow生成过程中的 refinement 步骤数量极为有限（如仅一步），文本特征表示需要具备足够高的可区分性。这也解释了为何离散且易于区分的类别特征在MeanFlow框架中表现良好。基于这些发现，我们采用经验证具备所需语义特性的LLM文本编码器，并将MeanFlow生成过程适配至该框架，首次实现了高效的文本条件合成。此外，我们在广泛使用的扩散模型上验证了本方法，证明了生成性能的显著提升。我们希望这项工作能为未来文本条件MeanFlow生成研究提供通用且实用的参考。代码已开源：https://github.com/AMAP-ML/EMF。

English

Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation. Compared to the limited class labels, text conditions pose greater challenges to the model's understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance. To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number of refinement steps in the MeanFlow generation, such as only one step, the text feature representations are required to possess sufficiently high discriminability. This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Guided by these insights, we leverage a powerful LLM-based text encoder validated to possess the required semantic properties and adapt the MeanFlow generation process to this framework, resulting in efficient text-conditioned synthesis for the first time. Furthermore, we validate our approach on the widely used diffusion model, demonstrating significant generation performance improvements. We hope this work provides a general and practical reference for future research on text-conditioned MeanFlow generation. The code is available at https://github.com/AMAP-ML/EMF.