多模态大语言模型在离散符号理解中的认知失配

摘要

尽管多模态大语言模型（MLLMs）在自然场景理解方面取得了显著成就，但其处理离散符号——人类认知的基本单元——的能力仍是一个关键悬而未题。与连续视觉数据不同，数学公式、化学结构、语言字符等符号需要精确且更深层次的解析。本文提出一个综合性基准测试，用于评估顶尖MLLMs在语言、文化、数学、物理、化学五大领域中对这些“离散语义空间”的驾驭能力。研究发现了一个反直觉现象：模型常能完成复杂推理任务，却在基础符号识别上表现不佳，这表明其依赖语言概率而非真正的视觉感知。通过揭示这种“认知错位”，我们凸显了当前人工智能能力的重大缺陷：难以真正感知和理解支撑科学发现与抽象思维的符号语言。本研究为开发更严谨、与人类认知对齐的智能系统提供了路线图。

English

While Multimodal Large Language Models (MLLMs) have achieved remarkable success in interpreting natural scenes, their ability to process discrete symbols -- the fundamental building blocks of human cognition -- remains a critical open question. Unlike continuous visual data, symbols such as mathematical formulas, chemical structures, and linguistic characters require precise, deeper interpretation. This paper introduces a comprehensive benchmark to evaluate how top-tier MLLMs navigate these "discrete semantic spaces" across five domains: language, culture, mathematics, physics, and chemistry. Our investigation uncovers a counterintuitive phenomenon: models often fail at basic symbol recognition yet succeed in complex reasoning tasks, suggesting they rely on linguistic probability rather than true visual perception. By exposing this "cognitive mismatch", we highlight a significant gap in current AI capabilities: the struggle to truly perceive and understand the symbolic languages that underpin scientific discovery and abstract thought. This work offers a roadmap for developing more rigorous, human-aligned intelligent systems.

多模态大语言模型在离散符号理解中的认知失配

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

摘要

Support