元自适应提示蒸馏用于少样本视觉问答

摘要

大型多模态模型（LMMs）通常依赖上下文学习（ICL）以在最小监督下执行新任务。然而，ICL的表现，尤其是在较小的LMMs中，并不稳定，且不总是随着示例的增加而单调提升。我们推测，这是由于LMM被图像嵌入中额外的信息所淹没，而这些信息对于下游任务并非必需。为解决这一问题，我们提出了一种元学习方法，通过从任务相关的图像特征中提炼出一组固定的软提示，并在测试时利用少量示例进行适配，从而为LMM提供了一种诱导少样本能力的替代方案。为促进这一提炼过程，我们引入了一个注意力映射模块，该模块可轻松集成于流行的LLaVA v1.5架构中，并与软提示共同学习，使得LMM在低数据条件下仅需少量梯度步骤即可实现任务适配。在VL-ICL基准上的评估表明，我们的方法在图像扰动下仍持续优于ICL及相关的提示调优方法，提升了视觉问答任务中的任务诱导与推理能力。

English

Large Multimodal Models (LMMs) often rely on in-context learning (ICL) to perform new tasks with minimal supervision. However, ICL performance, especially in smaller LMMs, is inconsistent and does not always improve monotonically with increasing examples. We hypothesize that this occurs due to the LMM being overwhelmed by additional information present in the image embeddings, which is not required for the downstream task. To address this, we propose a meta-learning approach that provides an alternative for inducing few-shot capabilities in LMMs, using a fixed set of soft prompts that are distilled from task-relevant image features and can be adapted at test time using a few examples. To facilitate this distillation, we introduce an attention-mapper module that can be easily integrated with the popular LLaVA v1.5 architecture and is jointly learned with soft prompts, enabling task adaptation in LMMs under low-data regimes with just a few gradient steps. Evaluation on the VL-ICL Bench shows that our method consistently outperforms ICL and related prompt-tuning approaches, even under image perturbations, improving task induction and reasoning across visual question answering tasks.

元自适应提示蒸馏用于少样本视觉问答

Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering

摘要

Support