Taiyi-Diffusion-XL: 大規模視覚言語モデルを活用した二言語テキスト画像生成の進化

要旨

テキストから画像生成モデルの最近の進展により、画像生成能力が大幅に向上しているものの、バイリンガルまたは中国語対応のオープンソースモデルには依然として大きなギャップが存在しています。このニーズに対応するため、我々はTaiyi-Diffusion-XLを提案します。これは、CLIPとStable-Diffusion-XLの能力をバイリンガル連続事前学習プロセスを通じて拡張した、新たな中国語・英語バイリンガルテキスト画像生成モデルです。このアプローチでは、CLIPのトークナイザーと埋め込み層に最も頻繁に使用される中国語文字を統合することで語彙を効率的に拡張し、絶対位置符号化の拡張を組み合わせています。さらに、大規模視覚言語モデルを用いてテキストプロンプトを充実させ、より優れた画像キャプションと高い視覚品質を実現しています。これらの強化は、その後下流のテキスト画像生成モデルに適用されます。我々の実験結果は、開発したCLIPモデルがバイリンガル画像テキスト検索において優れていることを示しています。さらに、Taiyi-Diffusion-XLのバイリンガル画像生成能力は、従来のモデルを凌駕しています。本研究は、特に中国語アプリケーションにおける画像生成分野の顕著な進歩を表すTaiyi-Diffusion-XLモデルの開発とオープンソース化につながりました。この貢献は、マルチモーダル研究における多様な言語サポートの必要性に対応するための一歩前進です。モデルとデモンストレーションはhttps://huggingface.co/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/{このURL}で公開されており、この分野のさらなる研究と協力を促進しています。

English

Recent advancements in text-to-image models have significantly enhanced image generation capabilities, yet a notable gap of open-source models persists in bilingual or Chinese language support. To address this need, we present Taiyi-Diffusion-XL, a new Chinese and English bilingual text-to-image model which is developed by extending the capabilities of CLIP and Stable-Diffusion-XL through a process of bilingual continuous pre-training. This approach includes the efficient expansion of vocabulary by integrating the most frequently used Chinese characters into CLIP's tokenizer and embedding layers, coupled with an absolute position encoding expansion. Additionally, we enrich text prompts by large vision-language model, leading to better images captions and possess higher visual quality. These enhancements are subsequently applied to downstream text-to-image models. Our empirical results indicate that the developed CLIP model excels in bilingual image-text retrieval.Furthermore, the bilingual image generation capabilities of Taiyi-Diffusion-XL surpass previous models. This research leads to the development and open-sourcing of the Taiyi-Diffusion-XL model, representing a notable advancement in the field of image generation, particularly for Chinese language applications. This contribution is a step forward in addressing the need for more diverse language support in multimodal research. The model and demonstration are made publicly available at https://huggingface.co/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/{this https URL}, fostering further research and collaboration in this domain.

Taiyi-Diffusion-XL: 大規模視覚言語モデルを活用した二言語テキスト画像生成の進化

Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support

要旨

Support