ChatPaper.aiChatPaper

捕捉細節:自蒸餾區域興趣預測器用於細粒度多模態大語言模型感知

Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception

September 21, 2025
作者: Yuheng Shi, Xiaohuan Pei, Minjing Dong, Chang Xu
cs.AI

摘要

多模態大型語言模型(MLLMs)需要高分辨率的視覺信息來執行細粒度感知,然而處理整個高分辨率圖像在計算上是不可行的。雖然最近的方法利用感興趣區域(RoI)機制來聚焦於顯著區域,但它們通常面臨一個艱難的權衡:基於訓練的方法依賴於大規模的註釋數據集,而利用模型內部注意力的無訓練方法則計算效率低下且準確性較低,需要多遍預填充階段或依賴於緩慢的自迴歸解碼過程。在本文中,我們提出了一種高效、無需註釋的自蒸餾區域建議網絡(SD-RPN),以解決這一權衡。SD-RPN圍繞一個管道構建,該管道通過顯式去噪和解決歧義,將MLLM中間層的噪聲注意力圖轉化為高質量的偽RoI標籤。我們使用這些標籤來訓練一個輕量級的區域建議網絡(RPN),該網絡學習更精確的定位。這個RPN也非常高效,利用MLLM中間層的特徵在單次前向傳播中預測RoI,將RoI識別與自迴歸生成解耦,避免了昂貴的多遍操作。為了驗證我們的方法,我們將該框架集成到LLaVA-1.5架構中。儘管僅在少量(例如10K)問答對上進行訓練,我們的方法展示了卓越的數據效率和泛化能力,在未見的基準測試(包括TextVQA、DocVQA和V-Star)上實現了超過10%的絕對準確率提升。我們的工作為增強MLLMs的細粒度感知提供了一種實用且可擴展的解決方案,而無需昂貴的監督或全模型微調。代碼可在https://github.com/YuHengsss/SD-RPN獲取。
English
Multimodal Large Language Models (MLLMs) require high-resolution visual information to perform fine-grained perception, yet processing entire high-resolution images is computationally prohibitive. While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model's internal attention are computationally inefficient and less accurate, requiring either multi-pass prefill stages or reliance on the slow auto-regressive decoding process. In this paper, we propose an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves this trade-off. The SD-RPN is built around a pipeline that transforms the noisy attention maps from the MLLM's middle layers into high-quality pseudo-RoI labels by explicitly denoising the signal and resolving ambiguity. We use these labels to train a lightweight Region Proposal Network (RPN) that learns a more precise localization. This RPN is also highly efficient, predicting the RoI in a single forward pass using features from the MLLM's middle layers, decoupling RoI identification from the auto-regressive generation and avoiding costly multi-pass operations.To validate our approach, we integrate the framework into the LLaVA-1.5 architecture. Despite being trained on only a few (e.g. 10K) question-answer pairs, our method demonstrates exceptional data efficiency and generalization, achieving over a 10% absolute accuracy improvement on unseen benchmarks, including TextVQA, DocVQA, and V-Star. Our work presents a practical and scalable solution for enhancing the fine-grained perception of MLLMs without requiring costly supervision or full model fine-tuning. Code is available at https://github.com/YuHengsss/SD-RPN.
PDF32February 7, 2026