クラスラベルからテキストへ：識別的なテキスト表現によるワンステップ画像生成の拡張

要旨

数ステップ生成は長年追求されてきた目標であり、最近ではMeanFlowに代表されるワンステップ生成手法が注目すべき成果を上げている。既存のMeanFlow研究は主にクラスから画像への生成に焦点を当てている。しかし、固定されたクラスラベルから柔軟なテキスト入力を条件付けに拡張することで、より豊富なコンテンツ生成を可能とする方向性は直感的ながらも未開拓の領域である。限られたクラスラベルと比較して、テキスト条件はモデルの理解能力に対してより大きな課題を提起し、強力なテキストエンコーダをMeanFlowフレームワークに効果的に統合する必要がある。驚くべきことに、テキスト条件の組み込みは一見単純に見えるが、従来の学習戦略で強力なLLMベースのテキストエンコーダを統合しても満足のいく性能が得られないことが判明した。根本原因を解明するため詳細な分析を行った結果、MeanFlow生成における精練ステップ数が極めて限られている（例えば1ステップのみ）ため、テキスト特徴表現には十分に高い識別性が要求されることが明らかとなった。これは、離散的で識別が容易なクラス特徴がMeanFlowフレームワーク内で良好に機能する理由も説明している。これらの知見に基づき、我々は必要な意味的特性を備えた強力なLLMベーステキストエンコーダを活用し、MeanFlow生成プロセスをこのフレームワークに適応させることで、初めて効率的なテキスト条件付き合成を実現した。さらに、広く使用されている拡散モデルで本手法を検証し、生成性能が大幅に向上することを実証した。本研究が今後のテキスト条件付きMeanFlow生成研究に対する汎用的かつ実践的な指針を提供することを期待する。コードはhttps://github.com/AMAP-ML/EMFで公開されている。

English

Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation. Compared to the limited class labels, text conditions pose greater challenges to the model's understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance. To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number of refinement steps in the MeanFlow generation, such as only one step, the text feature representations are required to possess sufficiently high discriminability. This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Guided by these insights, we leverage a powerful LLM-based text encoder validated to possess the required semantic properties and adapt the MeanFlow generation process to this framework, resulting in efficient text-conditioned synthesis for the first time. Furthermore, we validate our approach on the widely used diffusion model, demonstrating significant generation performance improvements. We hope this work provides a general and practical reference for future research on text-conditioned MeanFlow generation. The code is available at https://github.com/AMAP-ML/EMF.

クラスラベルからテキストへ：識別的なテキスト表現によるワンステップ画像生成の拡張

Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

要旨

Support