超越文字：通过多模态自回归模型推进长文本图像生成

摘要

近期，自回归模型和扩散模型的进展在生成包含简短场景文本的图像方面表现出色。然而，对于当前生成模型而言，在图像中生成连贯的长文本（如幻灯片或文档中的段落）仍是一个重大挑战。我们首次提出了专注于长文本图像生成的研究，填补了现有文本到图像系统通常仅处理简短短语或单句的空白。通过对最先进的自回归生成模型进行全面分析，我们发现图像分词器是影响文本生成质量的关键瓶颈。为此，我们引入了一种新颖的、专注于文本的二进制分词器，优化了捕捉详细场景文本特征的能力。基于该分词器，我们开发了\模型名称，一种多模态自回归模型，在生成高质量长文本图像方面表现出前所未有的保真度。我们的模型提供了强大的可控性，支持自定义文本属性，如字体样式、大小、颜色和对齐方式。大量实验表明，\模型名称~在准确、一致且灵活地生成长文本方面显著优于SD3.5 Large~sd3和GPT4o~gpt4o与DALL-E 3~dalle3。除了技术成就外，\模型名称~还为创新应用开辟了令人兴奋的机会，如交错文档和PowerPoint生成，确立了长文本图像生成的新前沿。

English

Recent advancements in autoregressive and diffusion models have led to strong performance in image generation with short scene text words. However, generating coherent, long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models. We present the first work specifically focused on long text image generation, addressing a critical gap in existing text-to-image systems that typically handle only brief phrases or single sentences. Through comprehensive analysis of state-of-the-art autoregressive generation models, we identify the image tokenizer as a critical bottleneck in text generating quality. To address this, we introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features. Leveraging our tokenizer, we develop \ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity. Our model offers robust controllability, enabling customization of text properties such as font style, size, color, and alignment. Extensive experiments demonstrate that \ModelName~significantly outperforms SD3.5 Large~sd3 and GPT4o~gpt4o with DALL-E 3~dalle3 in generating long text accurately, consistently, and flexibly. Beyond its technical achievements, \ModelName~opens up exciting opportunities for innovative applications like interleaved document and PowerPoint generation, establishing a new frontier in long-text image generating.

超越文字：通过多模态自回归模型推进长文本图像生成

Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

摘要

Support