閱讀而非思考：當文本在多模態大模型中轉化為像素時，理解與彌合模態鴻溝

摘要

多模態大型語言模型（MLLMs）能夠處理以圖像形式呈現的文字，但其表現往往遜於直接接收相同內容的文本標記。我們透過在五種輸入模式下評估七個MLLMs模型於七個基準測試的表現，系統性診斷此「模態差距」——測試範圍涵蓋從arXiv PDF文檔到維基百科頁面的合成渲染文字與真實文檔圖像。研究發現模態差距具有任務與數據依賴性：例如數學任務在合成渲染文本上的表現下滑超過60分，而自然文檔圖像的表現常能匹配甚至超越文本模式。渲染選擇（如字體和解析度）是強烈干擾因素，僅字體差異就能導致準確率波動達47個百分點。為探究成因，我們對超過4,000個樣本進行紮根理論錯誤分析，發現圖像模式會選擇性放大閱讀錯誤（計算與格式解析失誤），而知識與推理錯誤基本保持不變，且部分模型在視覺輸入下出現思維鏈推理崩潰。基於這些發現，我們提出自蒸餾方法，將模型自身的純文本推理軌跡與圖像輸入配對訓練，使GSM8K數據集的圖像模式準確率從30.71%提升至92.72%，並能遷移至未見基準測試而不產生災難性遺忘。本研究系統性闡明了模態差距的成因，為提升多模態語言模型的視覺文本理解能力指出可行路徑。

English

Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the modality gap is task- and data-dependent. For example, math tasks degrade by over 60 points on synthetic renderings, while natural document images often match or exceed text-mode performance. Rendering choices such as font and resolution are strong confounds, with font alone swinging accuracy by up to 47 percentage points. To understand this, we conduct a grounded-theory error analysis of over 4,000 examples, revealing that image mode selectively amplifies reading errors (calculation and formatting failures) while leaving knowledge and reasoning errors largely unchanged, and that some models exhibit a chain-of-thought reasoning collapse under visual input. Motivated by these findings, we propose a self-distillation method that trains the model on its own pure text reasoning traces paired with image inputs, raising image-mode accuracy on GSM8K from 30.71% to 92.72% and transferring to unseen benchmarks without catastrophic forgetting. Overall, our study provides a systematic understanding of the modality gap and suggests a practical path toward improving visual text understanding in multimodal language models.

閱讀而非思考：當文本在多模態大模型中轉化為像素時，理解與彌合模態鴻溝

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

摘要

Support