多模態大語言模型在離散符號理解中的認知失配

摘要

儘管多模態大型語言模型（MLLMs）在解讀自然場景方面取得了顯著成就，但其處理離散符號——人類認知的基本單元——的能力仍是亟待解決的關鍵問題。與連續的視覺數據不同，數學公式、化學結構和語言字符等符號需要精確且更深層次的解讀。本文提出一個綜合性基準測試，用於評估頂級MLLMs在語言、文化、數學、物理和化學五大領域中駕馭這些「離散語義空間」的能力。我們的研究揭示了一個反直覺的現象：模型常無法完成基礎符號識別，卻能勝任複雜推理任務，這表明其依賴的是語言概率而非真正的視覺感知。通過揭露這種「認知錯配」，我們凸顯了當前人工智慧能力的重大缺陷：難以真正感知和理解支撐科學發現與抽象思維的符號語言。本研究為開發更嚴謹、符合人類認知的智能系統提供了路線圖。

English

While Multimodal Large Language Models (MLLMs) have achieved remarkable success in interpreting natural scenes, their ability to process discrete symbols -- the fundamental building blocks of human cognition -- remains a critical open question. Unlike continuous visual data, symbols such as mathematical formulas, chemical structures, and linguistic characters require precise, deeper interpretation. This paper introduces a comprehensive benchmark to evaluate how top-tier MLLMs navigate these "discrete semantic spaces" across five domains: language, culture, mathematics, physics, and chemistry. Our investigation uncovers a counterintuitive phenomenon: models often fail at basic symbol recognition yet succeed in complex reasoning tasks, suggesting they rely on linguistic probability rather than true visual perception. By exposing this "cognitive mismatch", we highlight a significant gap in current AI capabilities: the struggle to truly perceive and understand the symbolic languages that underpin scientific discovery and abstract thought. This work offers a roadmap for developing more rigorous, human-aligned intelligent systems.

多模態大語言模型在離散符號理解中的認知失配

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

摘要

Support