EvolveDirector: 대규모 Vision-Language 모델을 활용한 고급 텍스트-이미지 생성에 다가가다

초록

최근 생성 모델의 발전은 놀라운 능력을 보여주었습니다. 그러나 대부분은 독점적인 고품질 데이터로 훈련되었으며 일부 모델은 매개변수를 숨기고 접근 가능한 응용 프로그래밍 인터페이스(API)만 제공하여 하류 작업에 대한 혜택을 제한합니다. 공개적으로 이용 가능한 자원을 활용하여 선진 모델과 유사한 텍스트-이미지 생성 모델을 훈련하는 가능성을 탐색하기 위해 EvolveDirector를 소개합니다. 이 프레임워크는 고급 모델과 상호 작용하여 공개 API를 통해 텍스트-이미지 데이터 쌍을 얻어 기본 모델을 훈련시킵니다. 광범위한 데이터로 실험한 결과, 고급 모델의 생성 능력을 근사할 수 있는 것으로 나타났습니다. 그러나 1천만 개 이상의 대규모 샘플이 필요합니다. 이는 시간, 계산 자원, 특히 유료 API 호출에 따른 비용이 상당히 발생합니다. 이 문제를 해결하기 위해 사전 훈련된 대형 비전-언어 모델(VLM)을 활용하여 기본 모델의 진화를 이끌어냅니다. VLM은 훈련 중에 기본 모델을 지속적으로 평가하고 차별화, 확장, 삭제 및 돌연변이 작업을 통해 훈련 데이터셋을 동적으로 업데이트하고 정제합니다. 실험 결과는 이 패러다임이 필요한 데이터 양을 크게 줄인다는 것을 보여줍니다. 더불어 여러 고급 모델에 접근할 때 EvolveDirector는 그들이 생성한 최상의 샘플을 선택하여 강력하고 균형 잡힌 능력을 학습할 수 있습니다. 최종 훈련된 Edgen 모델은 이러한 고급 모델을 능가하는 것으로 입증되었습니다. 코드와 모델 가중치는 https://github.com/showlab/EvolveDirector에서 이용할 수 있습니다.

English

Recent advancements in generation models have showcased remarkable capabilities in generating fantastic content. However, most of them are trained on proprietary high-quality data, and some models withhold their parameters and only provide accessible application programming interfaces (APIs), limiting their benefits for downstream tasks. To explore the feasibility of training a text-to-image generation model comparable to advanced models using publicly available resources, we introduce EvolveDirector. This framework interacts with advanced models through their public APIs to obtain text-image data pairs to train a base model. Our experiments with extensive data indicate that the model trained on generated data of the advanced model can approximate its generation capability. However, it requires large-scale samples of 10 million or more. This incurs significant expenses in time, computational resources, and especially the costs associated with calling fee-based APIs. To address this problem, we leverage pre-trained large vision-language models (VLMs) to guide the evolution of the base model. VLM continuously evaluates the base model during training and dynamically updates and refines the training dataset by the discrimination, expansion, deletion, and mutation operations. Experimental results show that this paradigm significantly reduces the required data volume. Furthermore, when approaching multiple advanced models, EvolveDirector can select the best samples generated by them to learn powerful and balanced abilities. The final trained model Edgen is demonstrated to outperform these advanced models. The code and model weights are available at https://github.com/showlab/EvolveDirector.

EvolveDirector: 대규모 Vision-Language 모델을 활용한 고급 텍스트-이미지 생성에 다가가다

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

초록

Support