MangaVQA與MangaLMM:多模態漫畫理解的基準與專用模型
MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding
May 26, 2025
作者: Jeonghun Baek, Kazuki Egashira, Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Hikaru Ikuta, Kiyoharu Aizawa
cs.AI
摘要
漫畫,或稱日本漫畫,是一種融合圖像與文字於複雜敘事中的多模態藝術形式。教導大型多模態模型(LMMs)以接近人類的理解水平來解讀此類敘事,有助於漫畫創作者反思並精煉其故事。為此,我們引入了兩個用於多模態漫畫理解的基準:MangaOCR,專注於頁面內文字識別;以及MangaVQA,一個新穎的基準,旨在通過視覺問答來評估上下文理解能力。MangaVQA包含526組高質量、手工構建的問題-答案對,確保了在多樣化敘事與視覺情境下的可靠評估。基於這些基準,我們開發了MangaLMM,這是一個專為漫畫設計的模型,由開源LMM Qwen2.5-VL微調而來,能夠同時處理上述兩項任務。通過廣泛的實驗,包括與GPT-4o和Gemini 2.5等專有模型的比較,我們評估了LMMs對漫畫的理解程度。我們的基準與模型為在多模態敘事領域——漫畫中評估和推進LMMs提供了全面的基礎。
English
Manga, or Japanese comics, is a richly multimodal narrative form that blends
images and text in complex ways. Teaching large multimodal models (LMMs) to
understand such narratives at a human-like level could help manga creators
reflect on and refine their stories. To this end, we introduce two benchmarks
for multimodal manga understanding: MangaOCR, which targets in-page text
recognition, and MangaVQA, a novel benchmark designed to evaluate contextual
understanding through visual question answering. MangaVQA consists of 526
high-quality, manually constructed question-answer pairs, enabling reliable
evaluation across diverse narrative and visual scenarios. Building on these
benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the
open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive
experiments, including comparisons with proprietary models such as GPT-4o and
Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model
provide a comprehensive foundation for evaluating and advancing LMMs in the
richly narrative domain of manga.Summary
AI-Generated Summary