단어를 넘어: 멀티모달 자동회귀 모델을 통한 장문 이미지 생성의 발전

초록

최근의 자기회귀(autoregressive) 및 확산(diffusion) 모델의 발전으로 짧은 장면 텍스트 단어를 포함한 이미지 생성에서 강력한 성능을 보여주고 있습니다. 그러나 슬라이드나 문서의 단락과 같은 긴 형식의 텍스트를 이미지로 생성하는 것은 현재의 생성 모델들에게 여전히 주요한 과제로 남아 있습니다. 본 연구는 기존의 텍스트-이미지 시스템이 주로 짧은 구문이나 단일 문장만을 처리하는 데 그치는 한계를 해결하기 위해, 장문 텍스트 이미지 생성에 초점을 맞춘 첫 번째 연구를 제시합니다. 최첨단 자기회귀 생성 모델에 대한 포괄적인 분석을 통해, 텍스트 생성 품질에서 이미지 토크나이저가 주요 병목 현상임을 확인했습니다. 이를 해결하기 위해, 우리는 상세한 장면 텍스트 특징을 포착하도록 최적화된 새로운 텍스트 중심의 이진 토크나이저를 도입했습니다. 이 토크나이저를 활용하여, 우리는 전례 없는 충실도로 고품질의 장문 텍스트 이미지를 생성하는 데 탁월한 성능을 보이는 다중모드 자기회귀 모델인 \ModelName을 개발했습니다. 우리의 모델은 글꼴 스타일, 크기, 색상, 정렬과 같은 텍스트 속성을 사용자 정의할 수 있는 강력한 제어 기능을 제공합니다. 광범위한 실험을 통해 \ModelName이 SD3.5 Large~sd3 및 GPT4o~gpt4o with DALL-E 3~dalle3보다 장문 텍스트를 정확하고 일관성 있게, 유연하게 생성하는 데 있어서 크게 우수함을 입증했습니다. 기술적 성과를 넘어, \ModelName은 인터리브된 문서 및 PowerPoint 생성과 같은 혁신적인 응용 프로그램에 대한 흥미로운 기회를 열어, 장문 텍스트 이미지 생성의 새로운 지평을 열었습니다.

English

Recent advancements in autoregressive and diffusion models have led to strong performance in image generation with short scene text words. However, generating coherent, long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models. We present the first work specifically focused on long text image generation, addressing a critical gap in existing text-to-image systems that typically handle only brief phrases or single sentences. Through comprehensive analysis of state-of-the-art autoregressive generation models, we identify the image tokenizer as a critical bottleneck in text generating quality. To address this, we introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features. Leveraging our tokenizer, we develop \ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity. Our model offers robust controllability, enabling customization of text properties such as font style, size, color, and alignment. Extensive experiments demonstrate that \ModelName~significantly outperforms SD3.5 Large~sd3 and GPT4o~gpt4o with DALL-E 3~dalle3 in generating long text accurately, consistently, and flexibly. Beyond its technical achievements, \ModelName~opens up exciting opportunities for innovative applications like interleaved document and PowerPoint generation, establishing a new frontier in long-text image generating.

단어를 넘어: 멀티모달 자동회귀 모델을 통한 장문 이미지 생성의 발전

Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

초록

Support