ChatPaper.aiChatPaper

LEOPARD:用于文本丰富的多图任务的视觉语言模型

LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks

October 2, 2024
作者: Mengzhao Jia, Wenhao Yu, Kaixin Ma, Tianqing Fang, Zhihan Zhang, Siru Ouyang, Hongming Zhang, Meng Jiang, Dong Yu
cs.AI

摘要

文本丰富的图像在现实世界的应用中很常见,其中文本作为主要的视觉元素引导整体理解。这种图像在演示幻灯片、扫描文档和网页快照等场景中广泛存在。涉及多个文本丰富图像的任务尤其具有挑战性,因为它们不仅需要理解单个图像的内容,还需要推理跨多个视觉输入的相互关系和逻辑流。尽管这些场景非常重要,但目前的多模态大型语言模型(MLLMs)在处理此类任务时存在两个关键挑战:(1)缺乏针对文本丰富多图像场景的高质量指导调整数据集,以及(2)在图像分辨率和视觉特征序列长度之间难以平衡。为了解决这些挑战,我们提出了\OurMethod,这是一个专门设计用于处理涉及多个文本丰富图像的视觉-语言任务的MLLM。首先,我们精心策划了约一百万条高质量的多模态指导调整数据,专门针对文本丰富、多图像场景。其次,我们开发了一个自适应的高分辨率多图像编码模块,根据输入图像的原始长宽比和分辨率动态优化视觉序列长度的分配。在广泛的基准测试中,实验证明我们的模型在文本丰富、多图像评估方面具有卓越的能力,并在一般领域评估中表现出竞争力。
English
Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots. Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs. Despite the importance of these scenarios, current multimodal large language models (MLLMs) struggle to handle such tasks due to two key challenges: (1) the scarcity of high-quality instruction tuning datasets for text-rich multi-image scenarios, and (2) the difficulty in balancing image resolution with visual feature sequence length. To address these challenges, we propose \OurMethod, a MLLM designed specifically for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length based on the original aspect ratios and resolutions of the input images. Experiments across a wide range of benchmarks demonstrate our model's superior capabilities in text-rich, multi-image evaluations and competitive performance in general domain evaluations.

Summary

AI-Generated Summary

PDF265November 16, 2024