소수 샷 시각 질의응답을 위한 메타 적응형 프롬프트 증류

초록

대규모 멀티모달 모델(Large Multimodal Models, LMMs)은 최소한의 지도 하에 새로운 작업을 수행하기 위해 인컨텍스트 학습(In-Context Learning, ICL)에 의존하는 경우가 많습니다. 그러나 특히 더 작은 LMMs에서 ICL 성능은 일관적이지 않으며, 예시가 증가함에 따라 단조롭게 향상되지 않습니다. 우리는 이러한 현상이 LMM이 이미지 임베딩에 포함된 추가 정보로 인해 과부하 상태에 빠지기 때문이라고 가정합니다. 이는 다운스트림 작업에 필요하지 않은 정보입니다. 이를 해결하기 위해, 우리는 메타러닝 접근 방식을 제안합니다. 이 방식은 작업 관련 이미지 특징에서 추출된 고정된 소프트 프롬프트 세트를 사용하여 LMMs에 소수 샷(few-shot) 능력을 유도하는 대안을 제공하며, 테스트 시 몇 가지 예시를 사용하여 적응할 수 있습니다. 이러한 추출을 용이하게 하기 위해, 우리는 주의 매퍼(attention-mapper) 모듈을 도입했습니다. 이 모듈은 인기 있는 LLaVA v1.5 아키텍처와 쉽게 통합될 수 있으며, 소프트 프롬프트와 함께 공동 학습되어, 단 몇 번의 그래디언트 단계만으로도 저데이터 환경에서 LMMs의 작업 적응을 가능하게 합니다. VL-ICL 벤치에서의 평가 결과, 우리의 방법은 ICL 및 관련 프롬프트 튜닝 접근 방식을 일관적으로 능가하며, 이미지 왜곡 상황에서도 시각적 질의응답 작업에서의 작업 유도와 추론 능력을 향상시킵니다.

English

Large Multimodal Models (LMMs) often rely on in-context learning (ICL) to perform new tasks with minimal supervision. However, ICL performance, especially in smaller LMMs, is inconsistent and does not always improve monotonically with increasing examples. We hypothesize that this occurs due to the LMM being overwhelmed by additional information present in the image embeddings, which is not required for the downstream task. To address this, we propose a meta-learning approach that provides an alternative for inducing few-shot capabilities in LMMs, using a fixed set of soft prompts that are distilled from task-relevant image features and can be adapted at test time using a few examples. To facilitate this distillation, we introduce an attention-mapper module that can be easily integrated with the popular LLaVA v1.5 architecture and is jointly learned with soft prompts, enabling task adaptation in LMMs under low-data regimes with just a few gradient steps. Evaluation on the VL-ICL Bench shows that our method consistently outperforms ICL and related prompt-tuning approaches, even under image perturbations, improving task induction and reasoning across visual question answering tasks.

소수 샷 시각 질의응답을 위한 메타 적응형 프롬프트 증류

Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering

초록

Support