捕捉细节:面向细粒度多模态大模型感知的自蒸馏区域兴趣预测器
Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception
September 21, 2025
作者: Yuheng Shi, Xiaohuan Pei, Minjing Dong, Chang Xu
cs.AI
摘要
多模态大语言模型(MLLMs)需要高分辨率的视觉信息以实现细粒度感知,然而处理完整的高分辨率图像在计算上代价高昂。尽管近期方法利用感兴趣区域(RoI)机制聚焦于显著区域,但它们通常面临一个艰难的权衡:基于训练的方法依赖于大规模标注数据集,而利用模型内部注意力的无训练方法则计算效率低下且准确性不足,需要多轮预填充阶段或依赖缓慢的自回归解码过程。本文提出了一种高效、无需标注的自蒸馏区域提议网络(SD-RPN),有效解决了这一权衡问题。SD-RPN围绕一个管道构建,该管道通过显式去噪和消除歧义,将MLLM中间层的噪声注意力图转化为高质量的伪RoI标签。我们利用这些标签训练一个轻量级的区域提议网络(RPN),使其学习更精确的定位。该RPN同样高效,仅需单次前向传播即可预测RoI,利用MLLM中间层的特征,将RoI识别与自回归生成解耦,避免了昂贵的多轮操作。为验证我们的方法,我们将该框架集成到LLaVA-1.5架构中。尽管仅使用少量(如10K)问答对进行训练,我们的方法展现了卓越的数据效率和泛化能力,在未见过的基准测试(包括TextVQA、DocVQA和V-Star)上实现了超过10%的绝对准确率提升。我们的工作为增强MLLMs的细粒度感知提供了一种实用且可扩展的解决方案,无需昂贵的监督或全模型微调。代码可在https://github.com/YuHengsss/SD-RPN获取。
English
Multimodal Large Language Models (MLLMs) require high-resolution visual
information to perform fine-grained perception, yet processing entire
high-resolution images is computationally prohibitive. While recent methods
leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they
typically present a difficult trade-off: training-based approaches depend on
large-scale annotated datasets, while training-free methods that utilize the
model's internal attention are computationally inefficient and less accurate,
requiring either multi-pass prefill stages or reliance on the slow
auto-regressive decoding process. In this paper, we propose an efficient,
annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves
this trade-off. The SD-RPN is built around a pipeline that transforms the noisy
attention maps from the MLLM's middle layers into high-quality pseudo-RoI
labels by explicitly denoising the signal and resolving ambiguity. We use these
labels to train a lightweight Region Proposal Network (RPN) that learns a more
precise localization. This RPN is also highly efficient, predicting the RoI in
a single forward pass using features from the MLLM's middle layers, decoupling
RoI identification from the auto-regressive generation and avoiding costly
multi-pass operations.To validate our approach, we integrate the framework into
the LLaVA-1.5 architecture. Despite being trained on only a few (e.g. 10K)
question-answer pairs, our method demonstrates exceptional data efficiency and
generalization, achieving over a 10% absolute accuracy improvement on unseen
benchmarks, including TextVQA, DocVQA, and V-Star. Our work presents a
practical and scalable solution for enhancing the fine-grained perception of
MLLMs without requiring costly supervision or full model fine-tuning. Code is
available at https://github.com/YuHengsss/SD-RPN.