GLOV: 視覚のための暗黙の最適化子としてのガイド付き大規模言語モデル

要旨

本研究では、大規模言語モデル（LLM）が視覚言語モデル（VLM）の暗黙の最適化器として機能するための新しい手法（GLOV）を提案します。GLOVは、LLMに下流タスクの説明をメタプロンプトし、適切なVLMプロンプト（例：CLIPを使用したゼロショット分類など）を問い合わせます。これらのプロンプトは、フィットネス関数を介して得られた純度測定に従ってランク付けされます。各最適化ステップでは、ランク付けされたプロンプトがコンテキスト内の例（およびその精度）として提供され、LLMに下流VLMが好むテキストプロンプトの知識を与えます。さらに、各最適化ステップで、LLMの生成プロセスを明示的に誘導するために、LLMによって見つかった前の最適化ステップでの正解と不正解の解の埋め込みからのオフセット差ベクトルを次世代ステップのためのネットワークの中間層に特に追加します。このオフセットベクトルは、LLMの生成を下流VLMが好む言語の方向に誘導し、下流の視覚タスクでの性能を向上させます。私たちは、16の異なるデータセットでGLOVを評価し、デュアルエンコーダー（例：CLIP）およびエンコーダーデコーダー（例：LLaVa）モデルの2つのファミリーを使用して、発見された解がこれらのモデルに対して最大15.0％および57.5％（平均で3.8％および21.6％）の認識性能を向上させることを示しました。

English

In this work, we propose a novel method (GLOV) enabling Large Language Models (LLMs) to act as implicit Optimizers for Vision-Langugage Models (VLMs) to enhance downstream vision tasks. Our GLOV meta-prompts an LLM with the downstream task description, querying it for suitable VLM prompts (e.g., for zero-shot classification with CLIP). These prompts are ranked according to a purity measure obtained through a fitness function. In each respective optimization step, the ranked prompts are fed as in-context examples (with their accuracies) to equip the LLM with the knowledge of the type of text prompts preferred by the downstream VLM. Furthermore, we also explicitly steer the LLM generation process in each optimization step by specifically adding an offset difference vector of the embeddings from the positive and negative solutions found by the LLM, in previous optimization steps, to the intermediate layer of the network for the next generation step. This offset vector steers the LLM generation toward the type of language preferred by the downstream VLM, resulting in enhanced performance on the downstream vision tasks. We comprehensively evaluate our GLOV on 16 diverse datasets using two families of VLMs, i.e., dual-encoder (e.g., CLIP) and encoder-decoder (e.g., LLaVa) models -- showing that the discovered solutions can enhance the recognition performance by up to 15.0% and 57.5% (3.8% and 21.6% on average) for these models.

GLOV: 視覚のための暗黙の最適化子としてのガイド付き大規模言語モデル

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

要旨

Support