FVG-PT: 시각-언어 모델을 위한 적응형 전경 시점 유도 프롬프트 튜닝

초록

CLIP 기반 프롬프트 튜닝은 사전 학습된 시각-언어 모델(VLM)이 다운스트림 작업에 효율적으로 적응할 수 있게 합니다. 기존 연구들이 상당한 진전을 이루었음에도 불구하고, 튜닝 과정에서 VLM의 내부 어텐션 표현 변화에는 상대적으로 적은 주의를 기울여 왔습니다. 본 논문에서는 프롬프트 튜닝 예측의 실패 모드를 시각 인코더의 전경(foreground) 어텐션 변화로 귀결시키고, 이러한 변화를 완화하기 위해 적응형 플러그 앤 플레이 전경 어텐션 가이던스 모듈인 FVG-PT(Foreground View-Guided Prompt Tuning)를 제안합니다. 구체적으로, FVG-PT는 학습 가능한 전경 신뢰도 게이트(Foreground Reliability Gate)를 도입하여 전경 뷰 품질을 자동으로 향상시키고, 전경 지식 증류 보상(Foreground Distillation Compensation) 모듈을 적용하여 시각 어텐션이 전경에 집중하도록 유도하며, 더 나아가 전경에 대한 과도한 집중으로 인한 일반화 성능 저하를 완화하기 위한 사전 교정(Prior Calibration) 모듈을 도입합니다. 다양한 백본 모델과 데이터셋에서의 실험을 통해 FVG-PT의 효과성과 호환성을 입증합니다. 코드는 https://github.com/JREion/FVG-PT에서 확인할 수 있습니다.

English

CLIP-based prompt tuning enables pretrained Vision-Language Models (VLMs) to efficiently adapt to downstream tasks. Although existing studies have made significant progress, they pay limited attention to changes in the internal attention representations of VLMs during the tuning process. In this paper, we attribute the failure modes of prompt tuning predictions to shifts in foreground attention of the visual encoder, and propose Foreground View-Guided Prompt Tuning (FVG-PT), an adaptive plug-and-play foreground attention guidance module, to alleviate the shifts. Concretely, FVG-PT introduces a learnable Foreground Reliability Gate to automatically enhance the foreground view quality, applies a Foreground Distillation Compensation module to guide visual attention toward the foreground, and further introduces a Prior Calibration module to mitigate generalization degradation caused by excessive focus on the foreground. Experiments on multiple backbone models and datasets show the effectiveness and compatibility of FVG-PT. Codes are available at: https://github.com/JREion/FVG-PT

FVG-PT: 시각-언어 모델을 위한 적응형 전경 시점 유도 프롬프트 튜닝

FVG-PT: Adaptive Foreground View-Guided Prompt Tuning for Vision-Language Models

초록

Support