ChatPaper.aiChatPaper

UM-Text:面向图像理解的一体化多模态模型

UM-Text: A Unified Multimodal Model for Image Understanding

January 13, 2026
作者: Lichen Ma, Xiaolong Fu, Gaojing Zhou, Zipeng Guo, Ting Zhu, Yichun Liu, Yu Shi, Jason Li, Junshi Huang
cs.AI

摘要

随着图像生成技术的快速发展,基于自然语言指令的视觉文本编辑日益受到关注。该任务的主要挑战在于充分理解指令与参考图像,从而生成与图像风格一致的视觉文本。现有方法通常需要繁琐的文本内容与属性(如字号、颜色、版式)指定步骤,且未充分考虑与参考图像的风格一致性。为此,我们提出UM-Text——一个通过自然语言指令实现上下文理解与视觉文本编辑的统一多模态模型。具体而言,我们引入视觉语言模型(VLM)处理指令与参考图像,从而基于上下文信息精细设计文本内容与版式。为生成精准和谐的视觉文本图像,我们进一步提出UM-Encoder来融合多模态条件信息的嵌入表示,其组合方式由VLM根据输入指令自动配置。在训练阶段,我们提出区域一致性损失函数,在潜在空间与RGB空间为字形生成提供更有效的监督,并设计定制化的三阶段训练策略以进一步提升模型性能。此外,我们构建了UM-DATA-200K大规模视觉文本图像数据集,涵盖多样化场景以支持模型训练。在多个公开基准测试上的大量定性与定量结果表明,本方法达到了最先进的性能水平。
English
With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. Specifically, we introduce a Visual Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information. To generate an accurate and harmonious visual text image, we further propose the UM-Encoder to combine the embeddings of various condition information, where the combination is automatically configured by VLM according to the input instruction. During training, we propose a regional consistency loss to offer more effective supervision for glyph generation on both latent and RGB space, and design a tailored three-stage training strategy to further enhance model performance. In addition, we contribute the UM-DATA-200K, a large-scale visual text image dataset on diverse scenes for model training. Extensive qualitative and quantitative results on multiple public benchmarks demonstrate that our method achieves state-of-the-art performance.
PDF41January 15, 2026