離散シンボル理解におけるマルチモーダル大規模言語モデルの認知的ミスマッチ

要旨

マルチモーダル大規模言語モデル（MLLM）は自然景観の解釈において目覚ましい成功を収めているが、人間の認知の基本構成要素である離散記号を処理する能力は、依然として重要な未解決問題である。連続的な視覚データとは異なり、数式、化学構造、言語文字などの記号は、精確でより深い解釈を必要とする。本論文は、トップクラスのMLLMが「離散的意味空間」を言語、文化、数学、物理学、化学の5領域にわたって如何に航行するかを評価する包括的ベンチマークを提案する。我々の調査は逆説的な現象を明らかにした：モデルは基本的な記号認識では失敗する一方、複雑な推論課題では成功することが多く、これは真の視覚的知覚ではなく言語的な確率に依存していることを示唆する。この「認知的ミスマッチ」を暴くことで、科学的発見と抽象的思考を支える記号的言語を真に知覚し理解するという、現在のAI能力における重大な隔たりを浮き彫りにする。本研究は、より厳密で人間に整合した知能システム開発への道筋を示す。

English

While Multimodal Large Language Models (MLLMs) have achieved remarkable success in interpreting natural scenes, their ability to process discrete symbols -- the fundamental building blocks of human cognition -- remains a critical open question. Unlike continuous visual data, symbols such as mathematical formulas, chemical structures, and linguistic characters require precise, deeper interpretation. This paper introduces a comprehensive benchmark to evaluate how top-tier MLLMs navigate these "discrete semantic spaces" across five domains: language, culture, mathematics, physics, and chemistry. Our investigation uncovers a counterintuitive phenomenon: models often fail at basic symbol recognition yet succeed in complex reasoning tasks, suggesting they rely on linguistic probability rather than true visual perception. By exposing this "cognitive mismatch", we highlight a significant gap in current AI capabilities: the struggle to truly perceive and understand the symbolic languages that underpin scientific discovery and abstract thought. This work offers a roadmap for developing more rigorous, human-aligned intelligent systems.

離散シンボル理解におけるマルチモーダル大規模言語モデルの認知的ミスマッチ

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

要旨

Support