DeepPerception: 知識集約的ビジュアルグラウンディングのためのMLLMにおけるR1様認知視覚知覚の進展

要旨

人間の専門家は、ドメイン知識を活用して知覚的特徴を洗練させることで、細かな視覚的識別に優れています。この能力は、現在のマルチモーダル大規模言語モデル（MLLM）では未発達のままです。MLLMは専門家レベルの膨大な知識を持ちながらも、視覚的知覚に推論を統合することが難しく、深い分析なしに直接的な応答を生成することが多いです。このギャップを埋めるため、我々は知識集約型視覚的グラウンディング（KVG）を導入しました。これは、細かな知覚とドメイン固有の知識統合を必要とする新しい視覚的グラウンディングタスクです。KVGの課題に対処するため、我々は認知視覚的知覚能力を強化したMLLMであるDeepPerceptionを提案します。我々のアプローチは、(1) 高品質で知識に整合したトレーニングサンプルを生成する自動データ合成パイプライン、および (2) 認知推論の足場を構築するための教師ありファインチューニングと知覚-認知シナジーを最適化する強化学習を組み合わせた二段階のトレーニングフレームワークから構成されます。パフォーマンスをベンチマークするため、我々はKVG-Benchを導入しました。これは10のドメインにまたがる1.3Kの手動でキュレーションされたテストケースを含む包括的なデータセットです。実験結果は、DeepPerceptionが直接的なファインチューニングを大幅に上回り、KVG-Benchで+8.08%の精度向上を達成し、ベースラインアプローチに対して+4.60%優れたクロスドメイン汎化を示すことを実証しています。我々の研究結果は、MLLMに認知プロセスを統合することが人間のような視覚的知覚にとって重要であることを強調し、マルチモーダル推論研究の新たな方向性を開拓します。データ、コード、およびモデルはhttps://github.com/thunlp/DeepPerceptionで公開されています。

English

Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features, a capability that remains underdeveloped in current Multimodal Large Language Models (MLLMs). Despite possessing vast expert-level knowledge, MLLMs struggle to integrate reasoning into visual perception, often generating direct responses without deeper analysis. To bridge this gap, we introduce knowledge-intensive visual grounding (KVG), a novel visual grounding task that requires both fine-grained perception and domain-specific knowledge integration. To address the challenges of KVG, we propose DeepPerception, an MLLM enhanced with cognitive visual perception capabilities. Our approach consists of (1) an automated data synthesis pipeline that generates high-quality, knowledge-aligned training samples, and (2) a two-stage training framework combining supervised fine-tuning for cognitive reasoning scaffolding and reinforcement learning to optimize perception-cognition synergy. To benchmark performance, we introduce KVG-Bench a comprehensive dataset spanning 10 domains with 1.3K manually curated test cases. Experimental results demonstrate that DeepPerception significantly outperforms direct fine-tuning, achieving +8.08\% accuracy improvements on KVG-Bench and exhibiting +4.60\% superior cross-domain generalization over baseline approaches. Our findings highlight the importance of integrating cognitive processes into MLLMs for human-like visual perception and open new directions for multimodal reasoning research. The data, codes, and models are released at https://github.com/thunlp/DeepPerception.

DeepPerception: 知識集約的ビジュアルグラウンディングのためのMLLMにおけるR1様認知視覚知覚の進展

DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding

要旨

Support