ChatPaper.aiChatPaper

GLOV:引导大型语言模型作为视觉的隐式优化器。

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

October 8, 2024
作者: M. Jehanzeb Mirza, Mengjie Zhao, Zhuoyuan Mao, Sivan Doveh, Wei Lin, Paul Gavrikov, Michael Dorkenwald, Shiqi Yang, Saurav Jha, Hiromi Wakaki, Yuki Mitsufuji, Horst Possegger, Rogerio Feris, Leonid Karlinsky, James Glass
cs.AI

摘要

在这项工作中,我们提出了一种新颖的方法(GLOV),使大型语言模型(LLMs)能够作为视觉-语言模型(VLMs)的隐式优化器,以增强下游视觉任务。我们的GLOV使用下游任务描述元提示LLM,查询适合的VLM提示(例如,用于与CLIP进行零样本分类)。这些提示根据通过适应函数获得的纯度度量进行排名。在每个相应的优化步骤中,排名的提示被馈送为上下文示例(及其准确性),以使LLM具备了解下游VLM偏好的文本提示类型的知识。此外,我们还在每个优化步骤中明确引导LLM生成过程,通过将LLM在先前优化步骤中找到的正解和负解的嵌入之间的偏移差向量,明确添加到网络的中间层,以用于下一代步骤。这个偏移向量引导LLM生成朝着下游VLM偏好的语言类型,从而提高了在下游视觉任务上的性能。我们在16个不同数据集上全面评估了我们的GLOV,使用了双编码器(例如,CLIP)和编码器-解码器(例如,LLaVa)模型两类VLMs,结果显示发现的解决方案可以使这些模型的识别性能提高高达15.0%和57.5%(平均分别为3.8%和21.6%)。
English
In this work, we propose a novel method (GLOV) enabling Large Language Models (LLMs) to act as implicit Optimizers for Vision-Langugage Models (VLMs) to enhance downstream vision tasks. Our GLOV meta-prompts an LLM with the downstream task description, querying it for suitable VLM prompts (e.g., for zero-shot classification with CLIP). These prompts are ranked according to a purity measure obtained through a fitness function. In each respective optimization step, the ranked prompts are fed as in-context examples (with their accuracies) to equip the LLM with the knowledge of the type of text prompts preferred by the downstream VLM. Furthermore, we also explicitly steer the LLM generation process in each optimization step by specifically adding an offset difference vector of the embeddings from the positive and negative solutions found by the LLM, in previous optimization steps, to the intermediate layer of the network for the next generation step. This offset vector steers the LLM generation toward the type of language preferred by the downstream VLM, resulting in enhanced performance on the downstream vision tasks. We comprehensively evaluate our GLOV on 16 diverse datasets using two families of VLMs, i.e., dual-encoder (e.g., CLIP) and encoder-decoder (e.g., LLaVa) models -- showing that the discovered solutions can enhance the recognition performance by up to 15.0% and 57.5% (3.8% and 21.6% on average) for these models.

Summary

AI-Generated Summary

PDF162November 16, 2024