ChatPaper.aiChatPaper

漫画思维:通过结构化视觉叙事提升多模态推理能力

Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling

February 2, 2026
作者: Andong Chen, Wenxin Zhu, Qiuyu Ding, Yuchen Song, Muyun Yang, Tiejun Zhao
cs.AI

摘要

思维链推理技术已推动大语言模型从纯文本思考扩展到图像与视频思考。然而不同模态仍存在明显局限:静态图像难以呈现时序结构,而视频则会引入大量冗余信息与计算成本。本研究提出"漫画思维"视觉推理范式,将漫画作为介于图像与视频之间的高信息密度媒介。漫画在显著降低推理成本的同时,能保留时序结构、嵌入文本及叙事连贯性。我们系统研究了基于漫画的两种推理路径,并在多类推理任务与长上下文理解任务中进行评估。实验结果表明,在多步骤时序与因果推理任务中,漫画思维优于图像思维,同时仍比视频思维显著高效。进一步分析表明,不同漫画叙事结构与风格会对各类任务表现产生持续影响,这证实漫画可作为提升多模态推理能力的有效中间视觉表征。
English
Chain-of-Thought reasoning has driven large language models to extend from thinking with text to thinking with images and videos. However, different modalities still have clear limitations: static images struggle to represent temporal structure, while videos introduce substantial redundancy and computational cost. In this work, we propose Thinking with Comics, a visual reasoning paradigm that uses comics as a high information-density medium positioned between images and videos. Comics preserve temporal structure, embedded text, and narrative coherence while requiring significantly lower reasoning cost. We systematically study two reasoning paths based on comics and evaluate them on a range of reasoning tasks and long-context understanding tasks. Experimental results show that Thinking with Comics outperforms Thinking with Images on multi-step temporal and causal reasoning tasks, while remaining substantially more efficient than Thinking with Video. Further analysis indicates that different comic narrative structures and styles consistently affect performance across tasks, suggesting that comics serve as an effective intermediate visual representation for improving multimodal reasoning.
PDF344February 7, 2026