從類別標籤到文本的單步圖像生成擴展：基於判別性文本表徵

摘要

少步生成長期以來一直是研究目標，近期以MeanFlow為代表的單步生成方法取得了顯著成果。現有對MeanFlow的研究主要聚焦於類別到圖像的生成，然而一個直觀卻尚未探索的方向是將條件從固定類別標籤擴展至靈活的文本輸入，從而實現更豐富的內容創作。相較於有限的類別標籤，文本條件對模型的理解能力提出了更高要求，需要將強大的文本編碼器有效整合到MeanFlow框架中。令人驚訝的是，儘管引入文本條件看似直接，我們發現採用傳統訓練策略整合基於LLM的強大文本編碼器會導致性能不盡理想。為探究根本原因，我們進行了細緻分析並揭示：由於MeanFlow生成過程的精煉步數極少（例如僅一步），文本特徵表徵需具備足夠高的區分度。這也解釋了為何離散且易於區分的類別特徵在MeanFlow框架中表現優異。基於這些發現，我們採用經過驗證具備所需語義特性的LLM文本編碼器，並將MeanFlow生成過程適配至此框架，首次實現了高效的文本條件合成。此外，我們在廣泛使用的擴散模型上驗證了本方法，證實其能顯著提升生成性能。我們希望這項工作能為未來文本條件MeanFlow生成研究提供通用且實用的參考。相關代碼已開源於：https://github.com/AMAP-ML/EMF。

English

Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation. Compared to the limited class labels, text conditions pose greater challenges to the model's understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance. To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number of refinement steps in the MeanFlow generation, such as only one step, the text feature representations are required to possess sufficiently high discriminability. This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Guided by these insights, we leverage a powerful LLM-based text encoder validated to possess the required semantic properties and adapt the MeanFlow generation process to this framework, resulting in efficient text-conditioned synthesis for the first time. Furthermore, we validate our approach on the widely used diffusion model, demonstrating significant generation performance improvements. We hope this work provides a general and practical reference for future research on text-conditioned MeanFlow generation. The code is available at https://github.com/AMAP-ML/EMF.

從類別標籤到文本的單步圖像生成擴展：基於判別性文本表徵

Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

摘要

Support