ChatPaper.aiChatPaper

超越文字:透過多模態自回歸模型推進長文本圖像生成

Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

March 26, 2025
作者: Alex Jinpeng Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Min Li
cs.AI

摘要

近期,自回归模型和扩散模型的进展在生成包含简短场景文字的图像方面取得了显著成效。然而,对于当前生成模型而言,在图像中生成连贯的长篇文字(如幻灯片或文档中的段落)仍是一个重大挑战。我们首次提出了专门针对长文本图像生成的研究,填补了现有文本到图像系统通常仅能处理简短短语或单句的关键空白。通过对最先进的自回归生成模型进行全面分析,我们发现图像分词器是影响文本生成质量的关键瓶颈。为此,我们引入了一种新颖的、专注于文本的二进制分词器,该分词器经过优化,能够捕捉详细的场景文字特征。利用这一分词器,我们开发了\模型名称,这是一个多模态自回归模型,在生成高质量长文本图像方面表现出前所未有的保真度。我们的模型提供了强大的可控性,允许用户自定义文本属性,如字体样式、大小、颜色和对齐方式。大量实验表明,\模型名称~在准确、一致且灵活地生成长文本方面显著优于SD3.5 Large~sd3和GPT4o~gpt4o与DALL-E 3~dalle3的组合。除了技术成就外,\模型名称~还为创新应用(如交错文档和PowerPoint生成)开辟了令人兴奋的新机遇,确立了长文本图像生成的新前沿。
English
Recent advancements in autoregressive and diffusion models have led to strong performance in image generation with short scene text words. However, generating coherent, long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models. We present the first work specifically focused on long text image generation, addressing a critical gap in existing text-to-image systems that typically handle only brief phrases or single sentences. Through comprehensive analysis of state-of-the-art autoregressive generation models, we identify the image tokenizer as a critical bottleneck in text generating quality. To address this, we introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features. Leveraging our tokenizer, we develop \ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity. Our model offers robust controllability, enabling customization of text properties such as font style, size, color, and alignment. Extensive experiments demonstrate that \ModelName~significantly outperforms SD3.5 Large~sd3 and GPT4o~gpt4o with DALL-E 3~dalle3 in generating long text accurately, consistently, and flexibly. Beyond its technical achievements, \ModelName~opens up exciting opportunities for innovative applications like interleaved document and PowerPoint generation, establishing a new frontier in long-text image generating.

Summary

AI-Generated Summary

PDF43March 27, 2025