ChatPaper.aiChatPaper

EvolveDirector:運用大型視覺語言模型接近先進的文本到圖像生成

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

October 9, 2024
作者: Rui Zhao, Hangjie Yuan, Yujie Wei, Shiwei Zhang, Yuchao Gu, Lingmin Ran, Xiang Wang, Zhangjie Wu, Junhao Zhang, Yingya Zhang, Mike Zheng Shou
cs.AI

摘要

最近生成模型的進步展示了在生成出色內容方面的卓越能力。然而,大多數模型都是在專有高質量數據上進行訓練,有些模型保留其參數並僅提供可訪問的應用程序編程接口(API),限制了它們對下游任務的好處。為了探索使用公開可用資源訓練與先進模型相媲美的文本到圖像生成模型的可行性,我們介紹了EvolveDirector。該框架通過與先進模型的公共API互動,以獲取文本-圖像數據對來訓練基礎模型。我們對大量數據進行的實驗表明,使用先進模型生成的數據訓練的模型可以近似其生成能力。然而,這需要大規模樣本,至少為1000萬個。這將帶來顯著的時間、計算資源費用,尤其是與收費API相關的成本。為了解決這個問題,我們利用預訓練的大視覺語言模型(VLM)來引導基礎模型的演進。VLM在訓練過程中持續評估基礎模型,並通過區分、擴展、刪除和變異操作動態更新和精煉訓練數據集。實驗結果表明,這種範式顯著減少了所需的數據量。此外,當接近多個先進模型時,EvolveDirector可以選擇由它們生成的最佳樣本來學習強大且平衡的能力。最終訓練出的模型Edgen被證明優於這些先進模型。代碼和模型權重可在https://github.com/showlab/EvolveDirector找到。
English
Recent advancements in generation models have showcased remarkable capabilities in generating fantastic content. However, most of them are trained on proprietary high-quality data, and some models withhold their parameters and only provide accessible application programming interfaces (APIs), limiting their benefits for downstream tasks. To explore the feasibility of training a text-to-image generation model comparable to advanced models using publicly available resources, we introduce EvolveDirector. This framework interacts with advanced models through their public APIs to obtain text-image data pairs to train a base model. Our experiments with extensive data indicate that the model trained on generated data of the advanced model can approximate its generation capability. However, it requires large-scale samples of 10 million or more. This incurs significant expenses in time, computational resources, and especially the costs associated with calling fee-based APIs. To address this problem, we leverage pre-trained large vision-language models (VLMs) to guide the evolution of the base model. VLM continuously evaluates the base model during training and dynamically updates and refines the training dataset by the discrimination, expansion, deletion, and mutation operations. Experimental results show that this paradigm significantly reduces the required data volume. Furthermore, when approaching multiple advanced models, EvolveDirector can select the best samples generated by them to learn powerful and balanced abilities. The final trained model Edgen is demonstrated to outperform these advanced models. The code and model weights are available at https://github.com/showlab/EvolveDirector.

Summary

AI-Generated Summary

PDF192November 16, 2024