DeepPerception:推進MLLMs中類R1認知視覺感知於知識密集型視覺定位的應用
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
March 17, 2025
作者: Xinyu Ma, Ziyang Ding, Zhicong Luo, Chi Chen, Zonghao Guo, Derek F. Wong, Xiaoyi Feng, Maosong Sun
cs.AI
摘要
人類專家擅長利用領域知識來精煉感知特徵,從而實現細粒度的視覺辨別,這一能力在當前的多模態大型語言模型(MLLMs)中仍顯不足。儘管MLLMs擁有大量專家級知識,但它們在將推理融入視覺感知方面存在困難,往往直接生成回應而缺乏深入分析。為彌合這一差距,我們引入了知識密集型視覺定位(KVG),這是一項新穎的視覺定位任務,要求同時具備細粒度感知和領域特定知識的整合能力。為應對KVG的挑戰,我們提出了DeepPerception,這是一個增強了認知視覺感知能力的MLLM。我們的方法包括:(1)一個自動化數據合成管道,用於生成高質量、知識對齊的訓練樣本;(2)一個兩階段訓練框架,結合了用於認知推理支架的監督微調和強化學習,以優化感知與認知的協同作用。為評估性能,我們引入了KVG-Bench,這是一個涵蓋10個領域、包含1.3K個手動策劃測試案例的綜合數據集。實驗結果表明,DeepPerception顯著優於直接微調,在KVG-Bench上實現了+8.08%的準確率提升,並在跨領域泛化能力上比基準方法高出+4.60%。我們的研究結果強調了將認知過程整合到MLLMs中以實現類人視覺感知的重要性,並為多模態推理研究開闢了新的方向。數據、代碼和模型已發佈於https://github.com/thunlp/DeepPerception。
English
Human experts excel at fine-grained visual discrimination by leveraging
domain knowledge to refine perceptual features, a capability that remains
underdeveloped in current Multimodal Large Language Models (MLLMs). Despite
possessing vast expert-level knowledge, MLLMs struggle to integrate reasoning
into visual perception, often generating direct responses without deeper
analysis. To bridge this gap, we introduce knowledge-intensive visual grounding
(KVG), a novel visual grounding task that requires both fine-grained perception
and domain-specific knowledge integration. To address the challenges of KVG, we
propose DeepPerception, an MLLM enhanced with cognitive visual perception
capabilities. Our approach consists of (1) an automated data synthesis pipeline
that generates high-quality, knowledge-aligned training samples, and (2) a
two-stage training framework combining supervised fine-tuning for cognitive
reasoning scaffolding and reinforcement learning to optimize
perception-cognition synergy. To benchmark performance, we introduce KVG-Bench
a comprehensive dataset spanning 10 domains with 1.3K manually curated test
cases. Experimental results demonstrate that DeepPerception significantly
outperforms direct fine-tuning, achieving +8.08\% accuracy improvements on
KVG-Bench and exhibiting +4.60\% superior cross-domain generalization over
baseline approaches. Our findings highlight the importance of integrating
cognitive processes into MLLMs for human-like visual perception and open new
directions for multimodal reasoning research. The data, codes, and models are
released at https://github.com/thunlp/DeepPerception.Summary
AI-Generated Summary