言葉を超えて：マルチモーダル自己回帰モデルによる長文画像生成の進展

要旨

最近の自己回帰モデルと拡散モデルの進歩により、短いシーンテキストの画像生成において強力な性能が実現されています。しかし、スライドや文書内の段落のような、長文テキストを含む画像を一貫して生成することは、現在の生成モデルにとって依然として大きな課題です。本論文では、長文テキスト画像生成に特化した初めての研究を提示し、通常は短いフレーズや単一文しか扱えない既存のテキスト画像生成システムの重要なギャップに対処します。最先端の自己回帰生成モデルを包括的に分析することで、画像トークナイザがテキスト生成品質の重要なボトルネックであることを特定しました。これに対処するため、詳細なシーンテキストの特徴を捉えるために最適化された、テキストに焦点を当てた新しいバイナリトークナイザを導入します。このトークナイザを活用し、高品質な長文テキスト画像を前例のない忠実度で生成する多モーダル自己回帰モデルである\ModelNameを開発しました。本モデルは、フォントスタイル、サイズ、色、配置などのテキストプロパティをカスタマイズ可能な強力な制御性を提供します。広範な実験により、\ModelNameがSD3.5 Large~sd3やGPT4o~gpt4o with DALL-E 3~dalle3を大幅に上回り、長文テキストを正確かつ一貫して柔軟に生成することが実証されました。技術的な成果を超えて、\ModelNameは、文書とPowerPointのインタリーブ生成のような革新的なアプリケーションの可能性を開拓し、長文テキスト画像生成の新たなフロンティアを確立します。

English

Recent advancements in autoregressive and diffusion models have led to strong performance in image generation with short scene text words. However, generating coherent, long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models. We present the first work specifically focused on long text image generation, addressing a critical gap in existing text-to-image systems that typically handle only brief phrases or single sentences. Through comprehensive analysis of state-of-the-art autoregressive generation models, we identify the image tokenizer as a critical bottleneck in text generating quality. To address this, we introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features. Leveraging our tokenizer, we develop \ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity. Our model offers robust controllability, enabling customization of text properties such as font style, size, color, and alignment. Extensive experiments demonstrate that \ModelName~significantly outperforms SD3.5 Large~sd3 and GPT4o~gpt4o with DALL-E 3~dalle3 in generating long text accurately, consistently, and flexibly. Beyond its technical achievements, \ModelName~opens up exciting opportunities for innovative applications like interleaved document and PowerPoint generation, establishing a new frontier in long-text image generating.

言葉を超えて：マルチモーダル自己回帰モデルによる長文画像生成の進展

Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

要旨

Support