视觉与语言中的一个缺失环节:关于漫画理解的调查
One missing piece in Vision and Language: A Survey on Comics Understanding
September 14, 2024
作者: Emanuele Vivoli, Andrey Barsky, Mohamed Ali Souibgui, Artemis LLabres, Marco Bertini, Dimosthenis Karatzas
cs.AI
摘要
视觉语言模型最近发展成为多功能系统,能够在各种任务中取得高性能,如文档理解、视觉问答和基础定位,通常在零样本设置下。漫画理解作为一个复杂而多层面的领域,将极大受益于这些进展。作为一种媒介,漫画结合了丰富的视觉和文本叙事,挑战着AI模型处理跨越图像分类、目标检测、实例分割以及通过连续面板实现更深层次叙事理解的任务。然而,漫画的独特结构 —— 以创意风格变化、阅读顺序和非线性叙事为特征 —— 提出了一组与其他视觉语言领域不同的挑战。在这项调查中,我们从数据集和任务角度全面审视了漫画理解。我们的贡献有五方面:(1) 我们分析了漫画媒介的结构,详细说明了其独特的构成要素;(2) 我们调查了漫画研究中广泛使用的数据集和任务,强调它们在推动该领域发展中的作用;(3) 我们介绍了漫画理解层(LoCU)框架,这是一个重新定义视觉语言任务在漫画中的分类法,并为未来工作奠定基础;(4) 我们根据LoCU框架对现有方法进行了详细审查和分类;(5) 最后,我们突出当前研究中的挑战,并提出未来探索方向,特别是在将视觉语言模型应用于漫画的背景下。这项调查是第一个提出面向任务的漫画智能框架,并旨在通过解决数据可用性和任务定义中的关键差距来指导未来研究。与此调查相关的项目可在https://github.com/emanuelevivoli/awesome-comics-understanding找到。
English
Vision-language models have recently evolved into versatile systems capable
of high performance across a range of tasks, such as document understanding,
visual question answering, and grounding, often in zero-shot settings. Comics
Understanding, a complex and multifaceted field, stands to greatly benefit from
these advances. Comics, as a medium, combine rich visual and textual
narratives, challenging AI models with tasks that span image classification,
object detection, instance segmentation, and deeper narrative comprehension
through sequential panels. However, the unique structure of comics --
characterized by creative variations in style, reading order, and non-linear
storytelling -- presents a set of challenges distinct from those in other
visual-language domains. In this survey, we present a comprehensive review of
Comics Understanding from both dataset and task perspectives. Our contributions
are fivefold: (1) We analyze the structure of the comics medium, detailing its
distinctive compositional elements; (2) We survey the widely used datasets and
tasks in comics research, emphasizing their role in advancing the field; (3) We
introduce the Layer of Comics Understanding (LoCU) framework, a novel taxonomy
that redefines vision-language tasks within comics and lays the foundation for
future work; (4) We provide a detailed review and categorization of existing
methods following the LoCU framework; (5) Finally, we highlight current
research challenges and propose directions for future exploration, particularly
in the context of vision-language models applied to comics. This survey is the
first to propose a task-oriented framework for comics intelligence and aims to
guide future research by addressing critical gaps in data availability and task
definition. A project associated with this survey is available at
https://github.com/emanuelevivoli/awesome-comics-understanding.Summary
AI-Generated Summary