ChatPaper.aiChatPaper

迈向多模态大语言模型的认知超感知

Toward Cognitive Supersensing in Multimodal Large Language Model

February 2, 2026
作者: Boyi Li, Yifan Shen, Yuanzhe Liu, Yifan Xu, Jiateng Liu, Xinzhuo Li, Zhengyuan Li, Jingyuan Zhu, Yunhan Zhong, Fangzhou Lan, Jianguo Cao, James M. Rehg, Heng Ji, Ismini Lourentzou, Xu Cao
cs.AI

摘要

多模态大语言模型(MLLMs)在开放词汇感知任务中取得了显著成功,但其解决复杂认知问题的能力仍然有限,尤其在视觉细节抽象且需要视觉记忆的场景下。当前方法主要沿文本空间扩展思维链(CoT)推理,即便在单靠语言难以实现清晰结构化推理时仍沿用此路径,且基本忽视了类人视觉空间画板与视觉意象的视觉推理机制。为弥补这一缺陷,我们提出认知超感知训练范式,通过集成潜在视觉意象预测(LVIP)模块,使MLLMs具备类人视觉意象能力——该模块联合学习视觉认知潜在嵌入序列并将其与答案对齐,从而形成基于视觉的内部推理链。我们进一步引入强化学习阶段,基于此 grounded 视觉潜在空间优化文本推理路径。为评估MLLMs的认知能力,我们提出CogSense-Bench综合视觉问答基准,涵盖五大认知维度。大量实验表明,采用认知超感知训练的MLLMs在CogSense-Bench上显著优于现有最优基线模型,并在跨领域数学与科学VQA基准上展现出卓越的泛化能力,这提示内部视觉意象可能是连接感知识别与认知理解的关键桥梁。我们将开源CogSense-Bench基准及模型权重。
English
Multimodal Large Language Models (MLLMs) have achieved remarkable success in open-vocabulary perceptual tasks, yet their ability to solve complex cognitive problems remains limited, especially when visual details are abstract and require visual memory. Current approaches primarily scale Chain-of-Thought (CoT) reasoning in the text space, even when language alone is insufficient for clear and structured reasoning, and largely neglect visual reasoning mechanisms analogous to the human visuospatial sketchpad and visual imagery. To mitigate this deficiency, we introduce Cognitive Supersensing, a novel training paradigm that endows MLLMs with human-like visual imagery capabilities by integrating a Latent Visual Imagery Prediction (LVIP) head that jointly learns sequences of visual cognitive latent embeddings and aligns them with the answer, thereby forming vision-based internal reasoning chains. We further introduce a reinforcement learning stage that optimizes text reasoning paths based on this grounded visual latent. To evaluate the cognitive capabilities of MLLMs, we present CogSense-Bench, a comprehensive visual question answering (VQA) benchmark assessing five cognitive dimensions. Extensive experiments demonstrate that MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench and exhibit superior generalization on out-of-domain mathematics and science VQA benchmarks, suggesting that internal visual imagery is potentially key to bridging the gap between perceptual recognition and cognitive understanding. We will open-source the CogSense-Bench and our model weights.
PDF162February 7, 2026