GLOV:引導式大型語言模型作為視覺的隱式優化器。
GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models
October 8, 2024
作者: M. Jehanzeb Mirza, Mengjie Zhao, Zhuoyuan Mao, Sivan Doveh, Wei Lin, Paul Gavrikov, Michael Dorkenwald, Shiqi Yang, Saurav Jha, Hiromi Wakaki, Yuki Mitsufuji, Horst Possegger, Rogerio Feris, Leonid Karlinsky, James Glass
cs.AI
摘要
在這項工作中,我們提出了一種新方法(GLOV),使得大型語言模型(LLMs)能夠作為視覺語言模型(VLMs)的隱式優化器,以增強下游視覺任務。我們的GLOV使用下游任務描述對LLM進行元提示,請求其提供適合的VLM提示(例如,用於CLIP的零-shot分類)。這些提示根據通過適應函數獲得的純度度量進行排名。在每個相應的優化步驟中,排名的提示被作為上下文示例(及其準確性)提供給LLM,以使其瞭解下游VLM偏好的文本提示類型的知識。此外,我們還通過將LLM在每個優化步驟中找到的正負解的嵌入之間的偏移差向量明確添加到網絡的中間層,以引導LLM生成過程。這個偏移向量將LLM生成引導到下游VLM偏好的語言類型,從而提高下游視覺任務的性能。我們在16個不同數據集上全面評估了我們的GLOV,使用雙編碼器(例如,CLIP)和編碼器-解碼器(例如,LLaVa)模型這兩個VLM家族,結果顯示發現的解決方案可以使這些模型的識別性能提高高達15.0%和57.5%(平均分別為3.8%和21.6%)。
English
In this work, we propose a novel method (GLOV) enabling Large Language Models
(LLMs) to act as implicit Optimizers for Vision-Langugage Models (VLMs) to
enhance downstream vision tasks. Our GLOV meta-prompts an LLM with the
downstream task description, querying it for suitable VLM prompts (e.g., for
zero-shot classification with CLIP). These prompts are ranked according to a
purity measure obtained through a fitness function. In each respective
optimization step, the ranked prompts are fed as in-context examples (with
their accuracies) to equip the LLM with the knowledge of the type of text
prompts preferred by the downstream VLM. Furthermore, we also explicitly steer
the LLM generation process in each optimization step by specifically adding an
offset difference vector of the embeddings from the positive and negative
solutions found by the LLM, in previous optimization steps, to the intermediate
layer of the network for the next generation step. This offset vector steers
the LLM generation toward the type of language preferred by the downstream VLM,
resulting in enhanced performance on the downstream vision tasks. We
comprehensively evaluate our GLOV on 16 diverse datasets using two families of
VLMs, i.e., dual-encoder (e.g., CLIP) and encoder-decoder (e.g., LLaVa) models
-- showing that the discovered solutions can enhance the recognition
performance by up to 15.0% and 57.5% (3.8% and 21.6% on average) for these
models.