Taiyi-Diffusion-XL:借助大型视觉语言模型支持推动双语文本到图像生成
Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support
January 26, 2024
作者: Xiaojun Wu, Dixiang Zhang, Ruyi Gan, Junyu Lu, Ziwei Wu, Renliang Sun, Jiaxing Zhang, Pingjian Zhang, Yan Song
cs.AI
摘要
最近文本到图像模型的进展显著增强了图像生成能力,然而在双语或中文语言支持方面仍存在明显的开源模型空白。为了解决这一需求,我们提出了Taiyi-Diffusion-XL,这是一种新的中英双语文本到图像模型,通过扩展CLIP和Stable-Diffusion-XL的能力,通过双语连续预训练的过程进行开发。该方法包括通过将最常用的中文字符整合到CLIP的分词器和嵌入层中,结合绝对位置编码扩展,实现词汇的高效扩展。此外,我们通过大型视觉-语言模型丰富文本提示,从而获得更好的图像描述并具有更高的视觉质量。这些增强措施随后应用于下游文本到图像模型。我们的实证结果表明,开发的CLIP模型在双语图像-文本检索方面表现出色。此外,Taiyi-Diffusion-XL的双语图像生成能力超越了先前的模型。这项研究促成了Taiyi-Diffusion-XL模型的开发和开源,代表了图像生成领域的一项显著进步,特别是对于中文应用。这一贡献是在多模态研究中更多样化语言支持需求方面的一大步。该模型和演示可在以下网址公开获取:https://huggingface.co/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/{此https网址},促进该领域的进一步研究和合作。
English
Recent advancements in text-to-image models have significantly enhanced image
generation capabilities, yet a notable gap of open-source models persists in
bilingual or Chinese language support. To address this need, we present
Taiyi-Diffusion-XL, a new Chinese and English bilingual text-to-image model
which is developed by extending the capabilities of CLIP and
Stable-Diffusion-XL through a process of bilingual continuous pre-training.
This approach includes the efficient expansion of vocabulary by integrating the
most frequently used Chinese characters into CLIP's tokenizer and embedding
layers, coupled with an absolute position encoding expansion. Additionally, we
enrich text prompts by large vision-language model, leading to better images
captions and possess higher visual quality. These enhancements are subsequently
applied to downstream text-to-image models. Our empirical results indicate that
the developed CLIP model excels in bilingual image-text retrieval.Furthermore,
the bilingual image generation capabilities of Taiyi-Diffusion-XL surpass
previous models. This research leads to the development and open-sourcing of
the Taiyi-Diffusion-XL model, representing a notable advancement in the field
of image generation, particularly for Chinese language applications. This
contribution is a step forward in addressing the need for more diverse language
support in multimodal research. The model and demonstration are made publicly
available at
https://huggingface.co/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/{this
https URL}, fostering further research and collaboration in this domain.