多模态任务向量实现了多次多模态上下文学习。

摘要

最近交错式大型多模态模型（LMMs）在少样本学习中取得的成功表明，在具有许多示例的情境学习（ICL）中可能对学习新任务有所帮助。然而，这种多样本多模态ICL设置存在一个关键问题：它在预训练阶段设定的模型上下文长度会受到根本性的限制。这个问题在多模态领域尤为突出，因为它需要处理文本和图像，需要额外的标记。这促使我们需要一种多模态方法，可以将许多样本压缩成更少的标记，而无需微调。在这项工作中，我们通过利用多模态任务向量（MTV）使LMMs能够执行多模态、多样本的情境学习，这些MTV是压缩在模型的注意力头中的情境示例的紧凑隐式表示。具体而言，我们首先证明了LMMs中存在这种MTV，然后利用这些提取的MTV，使其能够为各种视觉与语言任务实现多样本的情境学习。我们的实验表明，MTV能够随着压缩样本数量的增加而提高性能，并且能够推广到类似的跨领域任务，而无需额外的上下文长度进行推断。

English

The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens. This motivates the need for a multimodal method to compress many shots into fewer tokens without finetuning. In this work, we enable LMMs to perform multimodal, many-shot in-context learning by leveraging Multimodal Task Vectors (MTV)--compact implicit representations of in-context examples compressed in the model's attention heads. Specifically, we first demonstrate the existence of such MTV in LMMs and then leverage these extracted MTV to enable many-shot in-context learning for various vision-and-language tasks. Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference.

多模态任务向量实现了多次多模态上下文学习。

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

摘要

Support