FVG-PT：面向视觉语言模型的自适应前景视图引导提示调优

摘要

基于CLIP的提示调优技术能够使预训练视觉语言模型高效适配下游任务。尽管现有研究已取得显著进展，但较少关注调优过程中模型内部注意力表征的变化。本文发现提示调优预测的失效模式可归因于视觉编码器前景注意力的偏移，据此提出前景视图引导的提示调优框架（FVG-PT），通过自适应即插即用的前景注意力引导模块来缓解该问题。具体而言，FVG-PT引入可学习的前景可靠性门控以自动提升前景视图质量，应用前景蒸馏补偿模块引导视觉注意力聚焦于前景区域，并进一步通过先验校准模块缓解因过度关注前景导致的泛化性能下降。在多个骨干模型和数据集上的实验验证了FVG-PT的有效性与兼容性。代码已开源于：https://github.com/JREion/FVG-PT

English

CLIP-based prompt tuning enables pretrained Vision-Language Models (VLMs) to efficiently adapt to downstream tasks. Although existing studies have made significant progress, they pay limited attention to changes in the internal attention representations of VLMs during the tuning process. In this paper, we attribute the failure modes of prompt tuning predictions to shifts in foreground attention of the visual encoder, and propose Foreground View-Guided Prompt Tuning (FVG-PT), an adaptive plug-and-play foreground attention guidance module, to alleviate the shifts. Concretely, FVG-PT introduces a learnable Foreground Reliability Gate to automatically enhance the foreground view quality, applies a Foreground Distillation Compensation module to guide visual attention toward the foreground, and further introduces a Prior Calibration module to mitigate generalization degradation caused by excessive focus on the foreground. Experiments on multiple backbone models and datasets show the effectiveness and compatibility of FVG-PT. Codes are available at: https://github.com/JREion/FVG-PT