클래스 레이블에서 텍스트로의 일단계 이미지 생성 확장: 판별적 텍스트 표현을 통한 접근

초록

소수-스텝 생성은 오랜 기간 연구되어 온 목표로, 최근 MeanFlow를 대표로 하는 단일-스텝 생성 방법이 놀라운 성과를 거두었습니다. 기존 MeanFlow 연구는 주로 클래스-이미지 생성에 집중되어 있습니다. 그러나 직관적이면서도 탐색되지 않은 방향은 고정된 클래스 레이블 조건을 유연한 텍스트 입력으로 확장하여 더 풍부한 콘텐츠 생성이 가능하도록 하는 것입니다. 제한된 클래스 레이블에 비해 텍스트 조건은 모델의 이해 능력에 더 큰 도전을 제기하며, 강력한 텍스트 인코더를 MeanFlow 프레임워크에 효과적으로 통합해야 합니다. 놀랍게도, 텍스트 조건 통합이 단순해 보임에도 불구하고, 기존 훈련 전략을 사용하여 강력한 LLM 기반 텍스트 인코더를 통합하면 만족스러운 성능을 얻지 못한다는 사실을 발견했습니다. 근본적인 원인을 규명하기 위해 상세한 분석을 수행한 결과, MeanFlow 생성의 정제 단계 수가 극히 제한적(예: 단일 스텝)이기 때문에 텍스트 특징 표현이 충분히 높은 식별 능력을 보유해야 함을 밝혔습니다. 이는 이산적이고 쉽게 구분 가능한 클래스 특징이 MeanFlow 프레임워크 내에서 잘 작동하는 이유를 설명해 줍니다. 이러한 통찰을 바탕으로, 요구되는 의미론적 특성을 보유한 것으로 검증된 강력한 LLM 기반 텍스트 인코더를 활용하고 MeanFlow 생성 과정을 이 프레임워크에 적용하여 최초로 효율적인 텍스트 조건 합성을 구현했습니다. 더 나아가 널리 사용되는 확산 모델에서 우리의 접근법을 검증하여 생성 성능이 크게 개선됨을 입증했습니다. 본 연구가 향후 텍스트 조건 MeanFlow 생성 연구에 일반적이고 실용적인 참고 자료를 제공하기를 바랍니다. 코드는 https://github.com/AMAP-ML/EMF에서 확인할 수 있습니다.

English

Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation. Compared to the limited class labels, text conditions pose greater challenges to the model's understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance. To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number of refinement steps in the MeanFlow generation, such as only one step, the text feature representations are required to possess sufficiently high discriminability. This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Guided by these insights, we leverage a powerful LLM-based text encoder validated to possess the required semantic properties and adapt the MeanFlow generation process to this framework, resulting in efficient text-conditioned synthesis for the first time. Furthermore, we validate our approach on the widely used diffusion model, demonstrating significant generation performance improvements. We hope this work provides a general and practical reference for future research on text-conditioned MeanFlow generation. The code is available at https://github.com/AMAP-ML/EMF.

클래스 레이블에서 텍스트로의 일단계 이미지 생성 확장: 판별적 텍스트 표현을 통한 접근

Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

초록

Support