MangaVQA와 MangaLMM: 멀티모달 만화 이해를 위한 벤치마크 및 전용 모델

초록

만화, 또는 일본식 코믹스는 이미지와 텍스트를 복잡하게 결합한 풍부한 멀티모달 서사 형식입니다. 대규모 멀티모달 모델(LMM)이 이러한 서사를 인간 수준으로 이해하도록 가르치는 것은 만화 창작자들이 자신의 스토리를 반영하고 개선하는 데 도움을 줄 수 있습니다. 이를 위해 우리는 멀티모달 만화 이해를 위한 두 가지 벤치마크를 소개합니다: 페이지 내 텍스트 인식을 목표로 하는 MangaOCR와 시각적 질문 응답을 통해 문맥적 이해를 평가하도록 설계된 새로운 벤치마크인 MangaVQA입니다. MangaVQA는 526개의 고품질, 수작업으로 구성된 질문-답변 쌍으로 이루어져 있으며, 다양한 서사적 및 시각적 시나리오에서 신뢰할 수 있는 평가를 가능하게 합니다. 이러한 벤치마크를 기반으로, 우리는 오픈소스 LMM인 Qwen2.5-VL에서 미세 조정된 만화 전용 모델인 MangaLMM을 개발하여 두 작업을 동시에 처리합니다. GPT-4o 및 Gemini 2.5와 같은 독점 모델과의 비교를 포함한 광범위한 실험을 통해 LMM이 만화를 얼마나 잘 이해하는지 평가합니다. 우리의 벤치마크와 모델은 만화라는 풍부한 서사 영역에서 LMM을 평가하고 발전시키기 위한 포괄적인 기반을 제공합니다.

English

Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.

MangaVQA와 MangaLMM: 멀티모달 만화 이해를 위한 벤치마크 및 전용 모델

MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

초록

Support