LVLMsの識別的微調整

要旨

コントラスト学習されたビジョン言語モデル（VLM）のようなCLIPは、識別的なビジョン言語表現学習の事実上のアプローチとなっています。しかしながら、これらのモデルは言語理解が限られており、しばしば「単語の袋」の振る舞いを示します。同時に、ビジョンエンコーダーとLLMを組み合わせた大規模ビジョン言語モデル（LVLM）は、詳細なビジョン言語推論が可能であることが示されていますが、自己回帰的な性質から、識別的なタスクにはあまり適していません。本研究では、「両方の利点を組み合わせる」新しいLVLMの識別的微調整のためのトレーニングアプローチを提案し、強力な識別的および構成能力を実現します。基本的に、我々のアプローチは生成的LVLMを識別的なものに変換し、強力な画像テキストの識別能力と強化された言語理解能力を引き出します。我々の貢献は以下を含みます：（1）可変長および粒度の異なる画像テキストペアを使用してモデルをトレーニングするための対照的および次トークン予測損失を両方利用する、慎重に設計されたトレーニング/最適化フレームワーク。これには、当該フレームワークの構成要素の必要性を正当化する消去研究が伴います。（2）ソフトプロンプティングとLoRAアダプターの組み合わせを使用したパラメータ効率の適応方法。（3）同様のサイズの最先端のCLIPのようなモデルに比べて、標準の画像テキスト検索ベンチマークでの著しい改善と、構成能力の顕著な向上が含まれます。

English

Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include: (1) A carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components. (2) A parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters. (3) Significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.

LVLMsの識別的微調整

Discriminative Fine-tuning of LVLMs

要旨

Support