MangaVQA与MangaLMM:面向多模态漫画理解的基准与专用模型
MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding
May 26, 2025
作者: Jeonghun Baek, Kazuki Egashira, Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Hikaru Ikuta, Kiyoharu Aizawa
cs.AI
摘要
漫画,即日本连环画,是一种高度多模态的叙事形式,它以复杂的方式融合了图像与文字。训练大型多模态模型(LMMs)以人类般的水平理解此类叙事,有助于漫画创作者反思并精炼其故事。为此,我们引入了两个用于多模态漫画理解的基准:MangaOCR,专注于页面内文本识别;以及MangaVQA,一个新颖的基准,旨在通过视觉问答评估上下文理解能力。MangaVQA包含526个高质量、手工构建的问题-答案对,确保在多样化的叙事和视觉场景中进行可靠评估。基于这些基准,我们开发了MangaLMM,这是一个专为漫画优化的模型,从开源LMM Qwen2.5-VL微调而来,能够同时处理上述两项任务。通过广泛的实验,包括与GPT-4o和Gemini 2.5等专有模型的对比,我们评估了LMMs对漫画的理解程度。我们的基准和模型为在漫画这一富含叙事的领域中评估和推进LMMs提供了全面的基础。
English
Manga, or Japanese comics, is a richly multimodal narrative form that blends
images and text in complex ways. Teaching large multimodal models (LMMs) to
understand such narratives at a human-like level could help manga creators
reflect on and refine their stories. To this end, we introduce two benchmarks
for multimodal manga understanding: MangaOCR, which targets in-page text
recognition, and MangaVQA, a novel benchmark designed to evaluate contextual
understanding through visual question answering. MangaVQA consists of 526
high-quality, manually constructed question-answer pairs, enabling reliable
evaluation across diverse narrative and visual scenarios. Building on these
benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the
open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive
experiments, including comparisons with proprietary models such as GPT-4o and
Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model
provide a comprehensive foundation for evaluating and advancing LMMs in the
richly narrative domain of manga.Summary
AI-Generated Summary