MangaVQA與MangaLMM：多模態漫畫理解的基準與專用模型

摘要

漫畫，或稱日本漫畫，是一種融合圖像與文字於複雜敘事中的多模態藝術形式。教導大型多模態模型（LMMs）以接近人類的理解水平來解讀此類敘事，有助於漫畫創作者反思並精煉其故事。為此，我們引入了兩個用於多模態漫畫理解的基準：MangaOCR，專注於頁面內文字識別；以及MangaVQA，一個新穎的基準，旨在通過視覺問答來評估上下文理解能力。MangaVQA包含526組高質量、手工構建的問題-答案對，確保了在多樣化敘事與視覺情境下的可靠評估。基於這些基準，我們開發了MangaLMM，這是一個專為漫畫設計的模型，由開源LMM Qwen2.5-VL微調而來，能夠同時處理上述兩項任務。通過廣泛的實驗，包括與GPT-4o和Gemini 2.5等專有模型的比較，我們評估了LMMs對漫畫的理解程度。我們的基準與模型為在多模態敘事領域——漫畫中評估和推進LMMs提供了全面的基礎。

English

Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.

MangaVQA與MangaLMM：多模態漫畫理解的基準與專用模型

MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

摘要

Support