Taiyi-Diffusion-XL:透過大型視覺語言模型支援推進雙語文本到圖像生成
Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support
January 26, 2024
作者: Xiaojun Wu, Dixiang Zhang, Ruyi Gan, Junyu Lu, Ziwei Wu, Renliang Sun, Jiaxing Zhang, Pingjian Zhang, Yan Song
cs.AI
摘要
最近在文本到圖像模型方面的進展顯著增強了圖像生成能力,然而在支援雙語或中文的開源模型方面仍存在明顯差距。為了滿足這一需求,我們提出了Taiyi-Diffusion-XL,一個新的中英雙語文本到圖像模型,通過擴展CLIP和Stable-Diffusion-XL的能力,通過雙語持續預訓練的過程進行開發。該方法包括通過將最常用的中文字符整合到CLIP的分詞器和嵌入層中,並結合絕對位置編碼擴展,有效擴展詞彙量。此外,我們通過大型視覺語言模型豐富文本提示,從而獲得更好的圖像標題並具有更高的視覺質量。這些增強功能隨後應用於下游文本到圖像模型。我們的實證結果表明,所開發的CLIP模型在雙語圖像文本檢索方面表現出色。此外,Taiyi-Diffusion-XL的雙語圖像生成能力超越了先前的模型。這項研究促成了Taiyi-Diffusion-XL模型的開發和開源,代表了圖像生成領域的一個顯著進步,特別是針對中文應用。這一貢獻是解決多模態研究中對更多語言支持需求的一個進步。該模型和演示可在以下網址公開獲取:https://huggingface.co/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/{此處為URL},從而促進該領域的進一步研究和合作。
English
Recent advancements in text-to-image models have significantly enhanced image
generation capabilities, yet a notable gap of open-source models persists in
bilingual or Chinese language support. To address this need, we present
Taiyi-Diffusion-XL, a new Chinese and English bilingual text-to-image model
which is developed by extending the capabilities of CLIP and
Stable-Diffusion-XL through a process of bilingual continuous pre-training.
This approach includes the efficient expansion of vocabulary by integrating the
most frequently used Chinese characters into CLIP's tokenizer and embedding
layers, coupled with an absolute position encoding expansion. Additionally, we
enrich text prompts by large vision-language model, leading to better images
captions and possess higher visual quality. These enhancements are subsequently
applied to downstream text-to-image models. Our empirical results indicate that
the developed CLIP model excels in bilingual image-text retrieval.Furthermore,
the bilingual image generation capabilities of Taiyi-Diffusion-XL surpass
previous models. This research leads to the development and open-sourcing of
the Taiyi-Diffusion-XL model, representing a notable advancement in the field
of image generation, particularly for Chinese language applications. This
contribution is a step forward in addressing the need for more diverse language
support in multimodal research. The model and demonstration are made publicly
available at
https://huggingface.co/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/{this
https URL}, fostering further research and collaboration in this domain.