이산적 기호 이해를 위한 다중 모달 대규모 언어 모델에서의 인지 불일치

초록

다중모드 대형 언어 모델(MLLM)이 자연스러운 장면 해석에서는 놀라운 성공을 거두었지만, 인간 인지의 기본 구성 요소인 이산적 기호를 처리하는 능력은 여전히 중요한 미해결 과제로 남아 있습니다. 연속적인 시각 데이터와 달리 수학 공식, 화학 구조, 언어 문자와 같은 기호들은 정확하고 더 깊은 해석을 요구합니다. 본 논문은 언어, 문화, 수학, 물리학, 화학이라는 다섯 가지 영역에 걸쳐 최상위 MLLM이 이러한 '이산적 의미 공간'을 어떻게 탐색하는지 평가하기 위한 포괄적인 벤치마크를 소개합니다. 우리의 연구는 직관에 반하는 현상을 밝혀냈습니다: 모델들이 기본적인 기호 인식에서는 종종 실패하지만 복잡한 추론 과제에서는 성공하는 것으로 나타나, 이들이 진정한 시각적 인식보다는 언어적 확률에 의존하고 있음을 시사합니다. 이러한 '인지적 불일치'를 드러냄으로써, 우리는 과학적 발견과 추상적 사고의 기초가 되는 상징적 언어를 진정으로 지각하고 이해하는 데 있어 현재 인공지능 능력의 심각한 격차를 부각합니다. 본 연구는 보다 엄격하고 인간과 조화를 이루는 지능 시스템 개발을 위한 로드맵을 제시합니다.

English

While Multimodal Large Language Models (MLLMs) have achieved remarkable success in interpreting natural scenes, their ability to process discrete symbols -- the fundamental building blocks of human cognition -- remains a critical open question. Unlike continuous visual data, symbols such as mathematical formulas, chemical structures, and linguistic characters require precise, deeper interpretation. This paper introduces a comprehensive benchmark to evaluate how top-tier MLLMs navigate these "discrete semantic spaces" across five domains: language, culture, mathematics, physics, and chemistry. Our investigation uncovers a counterintuitive phenomenon: models often fail at basic symbol recognition yet succeed in complex reasoning tasks, suggesting they rely on linguistic probability rather than true visual perception. By exposing this "cognitive mismatch", we highlight a significant gap in current AI capabilities: the struggle to truly perceive and understand the symbolic languages that underpin scientific discovery and abstract thought. This work offers a roadmap for developing more rigorous, human-aligned intelligent systems.

이산적 기호 이해를 위한 다중 모달 대규모 언어 모델에서의 인지 불일치

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

초록

Support