阅读，非思考：理解并弥合多模态大模型中文本像素化的模态鸿沟

摘要

多模态大语言模型（MLLMs）虽能处理以图像形式呈现的文本，但其表现往往逊于直接接收文本符号输入的情况。我们通过系统性诊断这一“模态鸿沟”，在五种输入模式下评估了七个MLLMs在七个基准测试中的表现，涵盖从合成渲染文本到arXiv PDF及维基百科页面等真实文档图像。研究发现，模态鸿沟具有任务依赖性和数据依赖性：例如数学任务在合成渲染文本上性能下降超过60分，而自然文档图像的表现常与文本模式持平甚至更优。字体、分辨率等渲染选择是重要干扰因素，仅字体差异就可导致准确率波动高达47个百分点。通过基于扎根理论对4000余例样本进行错误分析，我们发现图像模式会选择性放大阅读错误（计算与格式解析失败），而知识性与推理错误基本保持不变，且某些模型在视觉输入下会出现思维链推理崩溃。基于这些发现，我们提出一种自蒸馏方法，将模型自身纯文本推理轨迹与图像输入配对训练，使GSM8K数据集的图像模式准确率从30.71%提升至92.72%，并能迁移至未见过的基准测试而不产生灾难性遗忘。本研究为模态鸿沟提供了系统性认知，并为提升多模态语言模型的视觉文本理解能力指明了可行路径。

English

Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the modality gap is task- and data-dependent. For example, math tasks degrade by over 60 points on synthetic renderings, while natural document images often match or exceed text-mode performance. Rendering choices such as font and resolution are strong confounds, with font alone swinging accuracy by up to 47 percentage points. To understand this, we conduct a grounded-theory error analysis of over 4,000 examples, revealing that image mode selectively amplifies reading errors (calculation and formatting failures) while leaving knowledge and reasoning errors largely unchanged, and that some models exhibit a chain-of-thought reasoning collapse under visual input. Motivated by these findings, we propose a self-distillation method that trains the model on its own pure text reasoning traces paired with image inputs, raising image-mode accuracy on GSM8K from 30.71% to 92.72% and transferring to unseen benchmarks without catastrophic forgetting. Overall, our study provides a systematic understanding of the modality gap and suggests a practical path toward improving visual text understanding in multimodal language models.

阅读，非思考：理解并弥合多模态大模型中文本像素化的模态鸿沟

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

摘要

Support