EvolveDirector：利用大型视觉语言模型接近先进的文本到图像生成

摘要

最近生成模型的进展展示了在生成出色内容方面的显著能力。然而，大多数模型是在专有高质量数据上训练的，一些模型保留其参数，仅提供可访问的应用程序接口（API），限制了它们在下游任务中的效益。为了探索使用公开可用资源训练文本到图像生成模型的可行性，我们介绍了EvolveDirector。该框架通过与先进模型的公共API进行交互，以获取文本-图像数据对来训练基础模型。我们对大量数据进行的实验表明，基于先进模型生成数据训练的模型可以近似其生成能力。然而，这需要大规模样本，数量为1000万或更多。这将导致时间、计算资源以及特别是调用基于付费的API所涉及的成本显著增加。为解决这一问题，我们利用预训练的大型视觉语言模型（VLM）来引导基础模型的演进。VLM在训练过程中持续评估基础模型，并通过区分、扩展、删除和突变操作动态更新和完善训练数据集。实验结果表明，这种范式显著减少了所需的数据量。此外，当接近多个先进模型时，EvolveDirector可以选择由它们生成的最佳样本，以学习强大且平衡的能力。最终训练的模型Edgen被证明优于这些先进模型。代码和模型权重可在https://github.com/showlab/EvolveDirector找到。

English

Recent advancements in generation models have showcased remarkable capabilities in generating fantastic content. However, most of them are trained on proprietary high-quality data, and some models withhold their parameters and only provide accessible application programming interfaces (APIs), limiting their benefits for downstream tasks. To explore the feasibility of training a text-to-image generation model comparable to advanced models using publicly available resources, we introduce EvolveDirector. This framework interacts with advanced models through their public APIs to obtain text-image data pairs to train a base model. Our experiments with extensive data indicate that the model trained on generated data of the advanced model can approximate its generation capability. However, it requires large-scale samples of 10 million or more. This incurs significant expenses in time, computational resources, and especially the costs associated with calling fee-based APIs. To address this problem, we leverage pre-trained large vision-language models (VLMs) to guide the evolution of the base model. VLM continuously evaluates the base model during training and dynamically updates and refines the training dataset by the discrimination, expansion, deletion, and mutation operations. Experimental results show that this paradigm significantly reduces the required data volume. Furthermore, when approaching multiple advanced models, EvolveDirector can select the best samples generated by them to learn powerful and balanced abilities. The final trained model Edgen is demonstrated to outperform these advanced models. The code and model weights are available at https://github.com/showlab/EvolveDirector.

EvolveDirector：利用大型视觉语言模型接近先进的文本到图像生成

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

摘要

Support