DeepPerception:推动MLLMs中类R1认知视觉感知的发展,实现知识密集型视觉定位
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
March 17, 2025
作者: Xinyu Ma, Ziyang Ding, Zhicong Luo, Chi Chen, Zonghao Guo, Derek F. Wong, Xiaoyi Feng, Maosong Sun
cs.AI
摘要
人类专家擅长通过运用领域知识来精炼感知特征,从而实现细粒度的视觉辨别,这一能力在当前的多模态大语言模型(MLLMs)中尚未得到充分发展。尽管MLLMs拥有海量的专家级知识,它们在将推理融入视觉感知方面仍面临挑战,往往直接生成回应而缺乏深入分析。为弥合这一差距,我们引入了知识密集型视觉定位(KVG),这是一项新颖的视觉定位任务,要求同时具备细粒度感知和领域特定知识的整合能力。针对KVG的挑战,我们提出了DeepPerception,一个增强了认知视觉感知能力的MLLM。我们的方法包括:(1)一个自动化数据合成管道,用于生成高质量、知识对齐的训练样本;(2)一个两阶段训练框架,结合了用于认知推理支架的监督微调和强化学习,以优化感知与认知的协同作用。为了评估性能,我们推出了KVG-Bench,这是一个涵盖10个领域、包含1.3K个手工精选测试案例的全面数据集。实验结果显示,DeepPerception显著优于直接微调,在KVG-Bench上实现了+8.08%的准确率提升,并在跨领域泛化能力上比基线方法高出+4.60%。我们的研究结果强调了将认知过程整合到MLLMs中以实现类人视觉感知的重要性,并为多模态推理研究开辟了新方向。数据、代码和模型已发布于https://github.com/thunlp/DeepPerception。
English
Human experts excel at fine-grained visual discrimination by leveraging
domain knowledge to refine perceptual features, a capability that remains
underdeveloped in current Multimodal Large Language Models (MLLMs). Despite
possessing vast expert-level knowledge, MLLMs struggle to integrate reasoning
into visual perception, often generating direct responses without deeper
analysis. To bridge this gap, we introduce knowledge-intensive visual grounding
(KVG), a novel visual grounding task that requires both fine-grained perception
and domain-specific knowledge integration. To address the challenges of KVG, we
propose DeepPerception, an MLLM enhanced with cognitive visual perception
capabilities. Our approach consists of (1) an automated data synthesis pipeline
that generates high-quality, knowledge-aligned training samples, and (2) a
two-stage training framework combining supervised fine-tuning for cognitive
reasoning scaffolding and reinforcement learning to optimize
perception-cognition synergy. To benchmark performance, we introduce KVG-Bench
a comprehensive dataset spanning 10 domains with 1.3K manually curated test
cases. Experimental results demonstrate that DeepPerception significantly
outperforms direct fine-tuning, achieving +8.08\% accuracy improvements on
KVG-Bench and exhibiting +4.60\% superior cross-domain generalization over
baseline approaches. Our findings highlight the importance of integrating
cognitive processes into MLLMs for human-like visual perception and open new
directions for multimodal reasoning research. The data, codes, and models are
released at https://github.com/thunlp/DeepPerception.Summary
AI-Generated Summary