Taiyi-Diffusion-XL：透過大型視覺語言模型支援推進雙語文本到圖像生成

摘要

最近在文本到圖像模型方面的進展顯著增強了圖像生成能力，然而在支援雙語或中文的開源模型方面仍存在明顯差距。為了滿足這一需求，我們提出了Taiyi-Diffusion-XL，一個新的中英雙語文本到圖像模型，通過擴展CLIP和Stable-Diffusion-XL的能力，通過雙語持續預訓練的過程進行開發。該方法包括通過將最常用的中文字符整合到CLIP的分詞器和嵌入層中，並結合絕對位置編碼擴展，有效擴展詞彙量。此外，我們通過大型視覺語言模型豐富文本提示，從而獲得更好的圖像標題並具有更高的視覺質量。這些增強功能隨後應用於下游文本到圖像模型。我們的實證結果表明，所開發的CLIP模型在雙語圖像文本檢索方面表現出色。此外，Taiyi-Diffusion-XL的雙語圖像生成能力超越了先前的模型。這項研究促成了Taiyi-Diffusion-XL模型的開發和開源，代表了圖像生成領域的一個顯著進步，特別是針對中文應用。這一貢獻是解決多模態研究中對更多語言支持需求的一個進步。該模型和演示可在以下網址公開獲取：https://huggingface.co/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/{此處為URL}，從而促進該領域的進一步研究和合作。

English

Recent advancements in text-to-image models have significantly enhanced image generation capabilities, yet a notable gap of open-source models persists in bilingual or Chinese language support. To address this need, we present Taiyi-Diffusion-XL, a new Chinese and English bilingual text-to-image model which is developed by extending the capabilities of CLIP and Stable-Diffusion-XL through a process of bilingual continuous pre-training. This approach includes the efficient expansion of vocabulary by integrating the most frequently used Chinese characters into CLIP's tokenizer and embedding layers, coupled with an absolute position encoding expansion. Additionally, we enrich text prompts by large vision-language model, leading to better images captions and possess higher visual quality. These enhancements are subsequently applied to downstream text-to-image models. Our empirical results indicate that the developed CLIP model excels in bilingual image-text retrieval.Furthermore, the bilingual image generation capabilities of Taiyi-Diffusion-XL surpass previous models. This research leads to the development and open-sourcing of the Taiyi-Diffusion-XL model, representing a notable advancement in the field of image generation, particularly for Chinese language applications. This contribution is a step forward in addressing the need for more diverse language support in multimodal research. The model and demonstration are made publicly available at https://huggingface.co/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/{this https URL}, fostering further research and collaboration in this domain.

Taiyi-Diffusion-XL：透過大型視覺語言模型支援推進雙語文本到圖像生成

Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support

摘要

Support